# So much to do, so little time

Trying to squeeze sense out of chemical data

## Visual pairwise comparison of distributions

While analysing some data from a dose respons screen, run across multiple cell lines, I need to visualize summarize curve data in a pairwise fashion. Specifically, I wanted to compaure area under the curve (AUC) values for the curve fits for the same compound between every pair of cell line. Given that an AUC needs a proper curve fit, this means that the number of non-NA AUCs is different for each cell line. As a result making  a scatter plot matrix (via plotmatrix) won’t do.

A more useful approach is to generate a matrix of density plots, such that each plot contains the distributions of AUCs from each pair of cell lines over laid on each other. It turns out that some data.frame wrangling and facet_grid makes this extremely easy.

 12345678 library(ggplot2) library(reshape) tmp1 <- data.frame(do.call(cbind, lapply(1:5, function(x) {   r <- rnorm(100, mean=sample(1:4, 1))   r[sample(1:100, 20)] <- NA   return(r) })))

Next, we need to expand this into a form that lets us facet by pairs of variables

 12345678 tmp2 <- do.call(rbind, lapply(1:5, function(i) {   do.call(rbind, lapply(1:5, function(j) {     r <- rbind(data.frame(var='D1', val=tmp1[,i]),                data.frame(var='D2', val=tmp1[,j]))     r <- data.frame(xx=names(tmp1)[i], yy=names(tmp1)[j], r)     return(r)   })) }))

Finally, we can make the plot

 1234 ggplot(tmp2, aes(x=val, fill=var))+   geom_density(alpha=0.2, position="identity")+   theme(legend.position = "none")+   facet_grid(xx ~ yy, scales='fixed')

Giving us the plot below.

I had initially asked this on StackOverflow where Arun provided a more elegant approach to composing the data.frame

Written by Rajarshi Guha

February 10th, 2013 at 3:03 pm

## Path Fingerprints and Hash Quality

Recently, on an email thread I was involved in, Egon mentioned that the CDK hashed fingerprints were probably being penalized by the poor hashing provided by Java’s hashCode method. Essentially, he suspected that the collision rate was high and so that the many bits were being set multiple times by different paths and that a fraction of bits were not being touched.

Recall that the CDK hashed fingerprint determines all topologically unique paths upto a certain length and stores them as strings (composed of atom & bond symbols). Each path is then converted to an int via the hashCode method and this int value is used to seed the Java random number generator. Using this generator a random integer value is obtained which is used as the position in the bit string which will be set to 1 for that specific path..

A quick modification to the CDK Fingerprinter code allowed me to dump out the number of times each position in the bitstring was being set, during the calculation of the fingerprint for a single molecule. Plotting the number of hits at each position allows us to visualize the effectiveness of the hashing mechanism. Given that the path strings being hashed are unique, a collision implies that two different paths are being hashed to the same bit position.

The figure alongside summarizes this for the CDK 1024-bit hashed fingerprints on 9 arbitrary molecules. The x-axis represents the bit position and the y-axis on each plot represents the number of times a given position is set to 1 during the calculation. All plots are on the same scale, so we can compare the different molecules (though size effects are not taken into account).

Visually, it appears that the bit positions being set are uniform randomly distributed throughout the length of the fingerprint. However, the number of collisions observed is non-trvial. While for most cases, there doesn’t seem to be a significant number of collisions, the substituted benzoic acid does have a number of bits that are set 4 times and many bits with 2 or more collisions.

The sparsity of triphenyl phosphine can be ascribed to the symmetry of the molecule and the consequent smaller number of unique paths being hashed. However it’s interesting to note that even in such a case, two bit positions see a collision and suggests that the hash function being employed is not that great.

This is a quick hack to get some evidence of hash function quality and its effect on hashed fingerprints. The immediate next step is to look at alternative hash functions. There are also other aspects of the measurement & visualization process that could be tweaked – taking into account molecular size, the actual number of unique paths and converting the plots shown here to some concise numeric representation, allowing us to summarize larger datasets in a single view.

Update – I just realized that the hash function is not the only factor here. The Java random number generator plays an important role. A quick test with the MD5 hash function indicates that we still see collisions (actually, more so than with hashCode), suggesting that the problem may be with how the RNG is being seeded (and the fact that only 48 bits of the seed are used).

Written by Rajarshi Guha

October 2nd, 2010 at 5:09 pm

## 2D Depictions in R Plots

In preparation for the upcoming R workshop at the EBI, I’ve been cleaning up the rcdk package and updating some features. One of the new features is the ability to get a 2D depiction as a raster image. Uptil now, 2D depictions were drawn in a Swing window – this allowed you to resize the window but not much else. You really couldn’t use it for anything else but viewing.

However, R-2.11.0 provides a new function called rasterImage, which overlays a raster image onto a pre-existing plot. It turns out that the png package lets me easily create such a raster image, either from a PNG file or from a vector of bytes. Given a molecules, we can get the byte array of its PNG representation via the view.image.2d function in the latest rcdk. As a result, you can now make a plot and then overlay a 2D depiction within the plot area. For example to get the picture shown alongside, we could do:

 12345 library(rcdk) m <- parse.smiles("C1CC2CC1C(=O)NC2") img <- view.image.2d(m, 200,200) plot(1:10, pch=19) rasterImage(img, 2,6, 6,10)

The latest version of rcdk and rpubchem is not on CRAN yet, but you can get source packages for OS X & Linux and binary packages for Windows at http://rguha.net/rcdk. Note that the latest version of rcdk requires R-2.11.0 along with rJava, rcdklibs, fingerprint and png as dependencies. If you’re interested in contributing check out the git repository.

Written by Rajarshi Guha

May 3rd, 2010 at 9:22 pm

## Plate Well Series Plots in R

Plate well series plots are a common way to summarize well level data across multiple plates in a high throughput screen. An example can be seen in Zhang et al. As I’ve been working with RNAi screens, this visualization has been a useful way to summarize screening data and the various transformations on that data. It’s fundamentally a simple scatter plot, with some extra annotations. Though the x-axis is labeled with plate number, the values on the x-axis are actually well locations. The y-axis is usually the signal from that well.

Since I use it often, here’s some code that will generate such a plot. The input is a list of matrices or data.frames, where each matrix or data.frame represents a plate. In addition you need to specify a “plate map” – a character matrix indicating whether a well is a sample, (“c”) positive control (“p”), negative control (“n”) or ignored (“x”). The code looks like

 1234567891011121314151617181920212223242526 plate.well.series <- function(plate.list, plate.map, draw.sep = TRUE, color=TRUE, ...) {   signals <- unlist(lapply(plate.list, as.numeric))   nwell <- prod(dim(plate.list[[1]]))   nplate <- length(signals) / nwell   cols <- 'black'   if (color) {     pcolor <- 'red'     ncolor <- 'green'     colormat <-  matrix(0, nrow=nrow(plate.list[[1]]), ncol=ncol(plate.list[[1]]))     colormat[which(plate.map == 'n')] <- ncolor     colormat[which(plate.map == 'p')] <- pcolor     colormat[which(plate.map == 'c')] <-  'black'     cols <- sapply(1:nwell, function(x) {       as.character(colormat)     })   }   plot(signals, xaxt='n', ylab='Signal', xlab='Plate Number', col = cols, ...)   if (color) legend('topleft', bty='n', fill=c(ncolor, pcolor, 'black'),                     legend=c('Negative', 'Positive', 'Sample'),                     y.intersp=1.2)   if (draw.sep) {     for (i in seq(1, length(signals)+nwell, by=nwell)) abline(v=i, col='grey')   }   axis(side=1, at = seq(1, length(signals), by=nwell) + (nwell/2), labels=1:nplate) }

An example of such a plot is below

Plate well series plot

Another example comparing normalized data from three runs of an RNAi screen investigating drug sensitization (also highlighting the fact that plate 7 in the 5nm run was messed up):

Comparing runs with plate well series plots

Written by Rajarshi Guha

July 14th, 2009 at 2:01 am