So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘visualization’ Category

Path Fingerprints and Hash Quality

with 5 comments

Recently, on an email thread I was involved in, Egon mentioned that the CDK hashed fingerprints were probably being penalized by the poor hashing provided by Java’s hashCode method. Essentially, he suspected that the collision rate was high and so that the many bits were being set multiple times by different paths and that a fraction of bits were not being touched.

Recall that the CDK hashed fingerprint determines all topologically unique paths upto a certain length and stores them as strings (composed of atom & bond symbols). Each path is then converted to an int via the hashCode method and this int value is used to seed the Java random number generator. Using this generator a random integer value is obtained which is used as the position in the bit string which will be set to 1 for that specific path..

A quick modification to the CDK Fingerprinter code allowed me to dump out the number of times each position in the bitstring was being set, during the calculation of the fingerprint for a single molecule. Plotting the number of hits at each position allows us to visualize the effectiveness of the hashing mechanism. Given that the path strings being hashed are unique, a collision implies that two different paths are being hashed to the same bit position.

The figure alongside summarizes this for the CDK 1024-bit hashed fingerprints on 9 arbitrary molecules. The x-axis represents the bit position and the y-axis on each plot represents the number of times a given position is set to 1 during the calculation. All plots are on the same scale, so we can compare the different molecules (though size effects are not taken into account).

Visually, it appears that the bit positions being set are uniform randomly distributed throughout the length of the fingerprint. However, the number of collisions observed is non-trvial. While for most cases, there doesn’t seem to be a significant number of collisions, the substituted benzoic acid does have a number of bits that are set 4 times and many bits with 2 or more collisions.

The sparsity of triphenyl phosphine can be ascribed to the symmetry of the molecule and the consequent smaller number of unique paths being hashed. However it’s interesting to note that even in such a case, two bit positions see a collision and suggests that the hash function being employed is not that great.

This is a quick hack to get some evidence of hash function quality and its effect on hashed fingerprints. The immediate next step is to look at alternative hash functions. There are also other aspects of the measurement & visualization process that could be tweaked – taking into account molecular size, the actual number of unique paths and converting the plots shown here to some concise numeric representation, allowing us to summarize larger datasets in a single view.

Update – I just realized that the hash function is not the only factor here. The Java random number generator plays an important role. A quick test with the MD5 hash function indicates that we still see collisions (actually, more so than with hashCode), suggesting that the problem may be with how the RNG is being seeded (and the fact that only 48 bits of the seed are used).

Written by Rajarshi Guha

October 2nd, 2010 at 5:09 pm

Some More Comparisons with the GSK Dataset

without comments

My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.

I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).

I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).

This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.

For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).

 

 

Written by Rajarshi Guha

May 24th, 2010 at 2:47 am

A Quick Look at the GSK Malaria Dataset

with 5 comments

A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.

There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.

I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.

The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.

Written by Rajarshi Guha

May 23rd, 2010 at 1:02 pm

When is a Bad Plate Bad?

without comments

When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates are good and which ones need to be redone. A common metric is the Z-factor which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.

Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics – SSMD (defined here and applications to quality control discussed here) is regarded as more effective than Z-factor – and in fact it’s advisable to look at multiple metrics rather than depend on any single one.

An alternative trick is to compare the Z-factor for a given plate to the trimmed Z-factor, which is evaluated using the trimmed mean and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we’d probably consider plates whose trimmed Z-factor are less than 0.5  and close to the diagonal. (Though for RNAi screens, Z’ = 0.5 might be too stringent).

From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we’d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.

A plot of Z-factor versus trimmed Z-factor for a set of 100 plates

A plot of Z-factor versus trimmed Z-factor for a set of 100 plates

Written by Rajarshi Guha

January 29th, 2010 at 5:47 pm

Plate Well Series Plots in R

with 2 comments

Plate well series plots are a common way to summarize well level data across multiple plates in a high throughput screen. An example can be seen in Zhang et al. As I’ve been working with RNAi screens, this visualization has been a useful way to summarize screening data and the various transformations on that data. It’s fundamentally a simple scatter plot, with some extra annotations. Though the x-axis is labeled with plate number, the values on the x-axis are actually well locations. The y-axis is usually the signal from that well.

Since I use it often, here’s some code that will generate such a plot. The input is a list of matrices or data.frames, where each matrix or data.frame represents a plate. In addition you need to specify a “plate map” – a character matrix indicating whether a well is a sample, (“c”) positive control (“p”), negative control (“n”) or ignored (“x”). The code looks like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
plate.well.series <- function(plate.list, plate.map, draw.sep = TRUE, color=TRUE, ...) {
  signals <- unlist(lapply(plate.list, as.numeric))
  nwell <- prod(dim(plate.list[[1]]))
  nplate <- length(signals) / nwell

  cols <- 'black'
  if (color) {
    pcolor <- 'red'
    ncolor <- 'green'
    colormat <-  matrix(0, nrow=nrow(plate.list[[1]]), ncol=ncol(plate.list[[1]]))
    colormat[which(plate.map == 'n')] <- ncolor
    colormat[which(plate.map == 'p')] <- pcolor
    colormat[which(plate.map == 'c')] <-  'black'
    cols <- sapply(1:nwell, function(x) {
      as.character(colormat)
    })
  }
  plot(signals, xaxt='n', ylab='Signal', xlab='Plate Number', col = cols, ...)
  if (color) legend('topleft', bty='n', fill=c(ncolor, pcolor, 'black'),
                    legend=c('Negative', 'Positive', 'Sample'),
                    y.intersp=1.2)
  if (draw.sep) {
    for (i in seq(1, length(signals)+nwell, by=nwell)) abline(v=i, col='grey')
  }
  axis(side=1, at = seq(1, length(signals), by=nwell) + (nwell/2), labels=1:nplate)
}

An example of such a plot is below


Plate Well Series Plot

Plate well series plot


Another example comparing normalized data from three runs of an RNAi screen investigating drug sensitization (also highlighting the fact that plate 7 in the 5nm run was messed up):


Comparing runs with plate well series plots

Comparing runs with plate well series plots


Written by Rajarshi Guha

July 14th, 2009 at 2:01 am