# So much to do, so little time

Trying to squeeze sense out of chemical data

## A Quick Look at the GSK Malaria Dataset

A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.

There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.

I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.

The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.

Written by Rajarshi Guha

May 23rd, 2010 at 1:02 pm

## RNAi in PubChem

While considering ways to disseminate RNAi screening data, I found out that PubChem now contains two RNAi screening datasets – AIDs 1622 and 1904. These screens reuse the PubChem bioaassay formats – which is both good and bad. For example, while there are a few standardized columns (such as PUBCHEM_ACTIVITY_SCORE), the bulk of the user deposited columns are not formally defined. In other words, you’d have to read the assay description. While not a huge deal, it would be nice if we could use pre-existing formats such as MIARE, analogous to MIAME for microarray data. That way we could determine the number of replicates, normalization method employed and other details of the screen. As far as I can tell all aspects an RNAi screen are still not fully defined in the MIAME vocabulary, and there don’t seem to be a whole lot of examples. But it’s a start.

But of course, nothing is perfect. Why, oh why, would a tab delimited format be contained within multiple worksheets of an Excel workbook!

Written by Rajarshi Guha

April 19th, 2010 at 12:19 am

Posted in bioinformatics

Tagged with , ,

When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates are good and which ones need to be redone. A common metric is the Z-factor which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.

Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics – SSMD (defined here and applications to quality control discussed here) is regarded as more effective than Z-factor – and in fact it’s advisable to look at multiple metrics rather than depend on any single one.

An alternative trick is to compare the Z-factor for a given plate to the trimmed Z-factor, which is evaluated using the trimmed mean and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we’d probably consider plates whose trimmed Z-factor are less than 0.5  and close to the diagonal. (Though for RNAi screens, Z’ = 0.5 might be too stringent).

From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we’d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.

A plot of Z-factor versus trimmed Z-factor for a set of 100 plates

Written by Rajarshi Guha

January 29th, 2010 at 5:47 pm

## Plate Well Series Plots in R

Plate well series plots are a common way to summarize well level data across multiple plates in a high throughput screen. An example can be seen in Zhang et al. As I’ve been working with RNAi screens, this visualization has been a useful way to summarize screening data and the various transformations on that data. It’s fundamentally a simple scatter plot, with some extra annotations. Though the x-axis is labeled with plate number, the values on the x-axis are actually well locations. The y-axis is usually the signal from that well.

Since I use it often, here’s some code that will generate such a plot. The input is a list of matrices or data.frames, where each matrix or data.frame represents a plate. In addition you need to specify a “plate map” – a character matrix indicating whether a well is a sample, (“c”) positive control (“p”), negative control (“n”) or ignored (“x”). The code looks like

 1234567891011121314151617181920212223242526 plate.well.series <- function(plate.list, plate.map, draw.sep = TRUE, color=TRUE, ...) {   signals <- unlist(lapply(plate.list, as.numeric))   nwell <- prod(dim(plate.list[[1]]))   nplate <- length(signals) / nwell   cols <- 'black'   if (color) {     pcolor <- 'red'     ncolor <- 'green'     colormat <-  matrix(0, nrow=nrow(plate.list[[1]]), ncol=ncol(plate.list[[1]]))     colormat[which(plate.map == 'n')] <- ncolor     colormat[which(plate.map == 'p')] <- pcolor     colormat[which(plate.map == 'c')] <-  'black'     cols <- sapply(1:nwell, function(x) {       as.character(colormat)     })   }   plot(signals, xaxt='n', ylab='Signal', xlab='Plate Number', col = cols, ...)   if (color) legend('topleft', bty='n', fill=c(ncolor, pcolor, 'black'),                     legend=c('Negative', 'Positive', 'Sample'),                     y.intersp=1.2)   if (draw.sep) {     for (i in seq(1, length(signals)+nwell, by=nwell)) abline(v=i, col='grey')   }   axis(side=1, at = seq(1, length(signals), by=nwell) + (nwell/2), labels=1:nplate) }

An example of such a plot is below

Plate well series plot

Another example comparing normalized data from three runs of an RNAi screen investigating drug sensitization (also highlighting the fact that plate 7 in the 5nm run was messed up):

Comparing runs with plate well series plots

Written by Rajarshi Guha

July 14th, 2009 at 2:01 am

## Hit Selection Methods for RNAi Screens

Over the last few weeks I’ve been getting up to speed on RNAi screening (see Lord et al for a nice overview) and one of the key features in this approach is to reliably select hits for followup. In many ways it is similar to hit selection in small molecule HTS but also employs techniques from the analysis of microarray data.

Very broadly, there are two types of selection methods: statistical and knowledge based. In general, one employs a statistical method first, possibly followed by triaging the initial list using a knowledge based method. As with small molecule HTS, true hits (true positives or TP) in general are the outliers, though it’s possible that assay artifacts will also look like true hits and these would be false positives (FP). But in general, the outliers are easy and the real challenge is with identifying moderately active siRNA’s (or compounds) as actual hits.

One can further divide statistical methods into ranking procedures or thresholding procedures. In the former one ranks the siRNA‘s by some metric and selects as many as can be followed up using available resources. In the latter, one identifies a threshold, above which any siRNA is regarded as a hit. One of the key features of many statistical methods is that they provide bounds on the FP and FN (false negative) rates. This is especially important with thresholding methods.

But what about siRNA‘s that lie just below the threshold? Should one ignore them? Is it possible to further investigate whether they might actually be hits? This is where a knowledge based approach can provide a second level of confirmation for false negatives or positives. This post summarizes recent work on hit selection in RNAi screens.

### Statistical methods

Many statistical methods are similar to or derived from approaches employed in HTS experiments for small molecules. An excellent overview of statistical methods in HTS analysis is given by Malo et al and covers both assay quality metrics as well as hit selection methods.

One of the most common methods is to use a threshold of $$\mu \pm k \sigma$$, where $$\mu$$ and $$\sigma$$ are the mean and standard deviation of the sample wells respectively and k is usually taken to be 3. This is also known as the z-score method, and assumes a normal distribution of signal values. However, in RNAi screens it’s not uncommon to have a number of extreme outliers. Such a situation would skew the mean based threshold to larger values, leading to fewer prospective hits being identified.

Instead, Zhang et al (as well as Chung et al) suggest the use of the $$median \pm 3 \textrm{MAD}$$ method, where MAD is the mean absolute deviation and is given by

$$\textrm{MAD} = 1.4826\, median( | x_i – median(x) | )$$

where $$x_i$$ is the signal from well $\latex i$ and $$x$$ represents all the sample wells. The key feature of this approach is that it is more robust to outliers and violations of the assumption that the signal data is normally distributed. Chung et al present a nice simulation study, highlighting the effectiveness of MAD based selection over mean based.

However, Zhang et al also note that the MAD method is not robust to skewed distributions. To address this they suggest a quartile based threshold method. For the sample data, one calculates the first, second and third quartiles ($$Q_1, Q_2, Q_3$$) and generates the lower and upper thresholds for hit selection as the smallest observed value just above $$Q_1 – c IQR$$ and the largest observed value just below $$Q_3 – c IQR$$ respectively. In both cases, $$IQR = Q_3 – Q_1$$. A key feature of this method is the choice of $$c$$, and is based on the desired targeted error rate (based on a normal distribution). Zhang et al show that one can choose $$c$$ such that the resultant error rate is equivalent to that obtained from a mean or median-based thresholding scheme, for a given $$k$$. Thus for $$k = 3$$ in those methods, the quartile method would set $$c = 1.7239$$.

In contrast to the above methods, which define a threshold and automatically select siRNA’s above the threshold as hits, one can order the wells using a metric and then simply select the top $$n$$ wells for followup, where $$n$$ is dictated by the resources available for followup. In this vein, Wiles et al discuss a variety of ranking metrics methods for RNAi screens. Some of the methods they discuss include simple background substraction quantile normalization and so on IQR normalization. Surprisingly, in their analysis, ranking of siRNA’s based on background subtraction outperformed IQR normalization.

Zhang et al (the Merck people have been busy!) described another ranking procedure based on the Strictly Standardized Mean Difference (SSMD) between signals from individual siRNA’s and a negative control. While conceptually similar to simple schemes such as percentage inhibition, this method is reported to be more robust to variability in the sample and control groups and also has a probabilistic interpretation. The latter aspect allowed the authors to develop a thresholding scheme, based on controlling the false positive and false negative rates.

Another approach described by Konig et al makes use of the fact that in RNAi screens, a given gene is targeted by 2 to 4 different siRNA’s. In contrast, none of the above methods, consider this aspect of an RNAi screen. Their method, termed RSA (redundant siRNA activity) analysis, initially ranks all wells in an assay based on their signals. Next, they consider the rank distribution of all siRNA’s targeting a given gene and determine a p-value, which indicates whether all the siRNA’s targeting a given gene have low ranks. They then rerank all wells based on their p-values , followed by their individual signals. The net result of this is that, even though a gene might have 4 wells (i.e., siRNA’s) exhibiting moderate activity (and so might be just below the threshold in a mean or MAD based analysis), it will be “weighted” more heavily than a gene with, say, just one highly active siRNA. Their experiments and analysis indicate that this is a useful method and their implementations (C#, Perl, R) are available. A nice side effect of this method is that they are able suggest an optimal (in terms of efficacy) number of redundant siRNA’s per gene.

### Platewise or experimentwise?

One aspect that I haven’t mentioned is whether these analyses should be performed in a platewise manner or across all the plates in an experiment (i.e. experimentwise manner). Given that an RNAi screen can include a large number of plates, it’s possible that plate specific effects might skew an anaysis that considers all plates together. Thus platewise analysis can handle different systematic errors within plates. At the same time, a platwise analysis would be ignorant of any clustering that may be present amongst the hits, which would be identifiable is one performed the analysis in an experimentwise manner. Zhang et al suggest the following scheme, based on their observations that platewise analysis tends to have a higher probability of identifying true hits.

• Select a thresholding method
• Apply the method in an experimentwise manner on the normalized data, to see if hit clusters are present
• If such clusters are present, repeat the above step on un-normalized data to check whether clustering is an artifact
• If no clusters are observed, perform the analysis in a platewise manner and select hits from each plate

While a useful protocol, Zhang et al describe a Bayesian approach, derived from a similar method for microarray data (Newton et al). One of the key features of this approach is that it incorporates aspects of a platewise and experimentwise analysis in a single framework, obviating the need to choose one or the other.

### Knowledge based methods

The statistics based methods work on the basis of relatively large samples and employ a variety of distributional assumptions. They’re certainly useful as a first pass filter, but various questions remain: Should we totally ignore wells that lie just below a threshold? Is there a way to check for false negatives? False positives?

To address these questions, before running a follow up screen, one can make use of external knowledge. More specifically, genes do not exist in isolation. Rather, they can be considered in terms of networks. Indeed, studies of PPI‘s (Chuang et al) and disease genes (Kohler et al) have suggested a guilt by association principle – genes that are close together in a network are more likely to exhibit similar functions or lead to similar phenotypes.

Wang et al employed this principle to characterize the genes targeted by siRNA’s in a screen, by constructing an interaction network (based on STRING data) and analyzing the connectivity of the genes from the screen in this context. Their observation was that genes corresponding to hits from the RNAi screen exhibited a higher degree of connectivity, compared to random networks. While an interesting approach, some of it does seem a little ad hoc. Given the fact that it depends on PPI data, the networks constructed from screening results may be incomplete, if the PPI database does not contain interactions for the genes in question. In addition, while their network based scoring method seems to correlate somewhat with followup results, it’s not very consistent. However, the approach makes sense and correctly catches false negatives in their test cases. Their R code for the network based scoring system is available.

Interestingly, I haven’t found many other studies of such knowledge based hit selection methods. While networks are one way to identify related genes, other methods could be considered and it would be interesting to see how they compare.

Written by Rajarshi Guha

June 21st, 2009 at 8:49 pm

Posted in research

Tagged with ,