# So much to do, so little time

Trying to squeeze sense out of chemical data

## Visual pairwise comparison of distributions

While analysing some data from a dose respons screen, run across multiple cell lines, I need to visualize summarize curve data in a pairwise fashion. Specifically, I wanted to compaure area under the curve (AUC) values for the curve fits for the same compound between every pair of cell line. Given that an AUC needs a proper curve fit, this means that the number of non-NA AUCs is different for each cell line. As a result making  a scatter plot matrix (via plotmatrix) won’t do.

A more useful approach is to generate a matrix of density plots, such that each plot contains the distributions of AUCs from each pair of cell lines over laid on each other. It turns out that some data.frame wrangling and facet_grid makes this extremely easy.

 12345678 library(ggplot2) library(reshape) tmp1 <- data.frame(do.call(cbind, lapply(1:5, function(x) {   r <- rnorm(100, mean=sample(1:4, 1))   r[sample(1:100, 20)] <- NA   return(r) })))

Next, we need to expand this into a form that lets us facet by pairs of variables

 12345678 tmp2 <- do.call(rbind, lapply(1:5, function(i) {   do.call(rbind, lapply(1:5, function(j) {     r <- rbind(data.frame(var='D1', val=tmp1[,i]),                data.frame(var='D2', val=tmp1[,j]))     r <- data.frame(xx=names(tmp1)[i], yy=names(tmp1)[j], r)     return(r)   })) }))

Finally, we can make the plot

 1234 ggplot(tmp2, aes(x=val, fill=var))+   geom_density(alpha=0.2, position="identity")+   theme(legend.position = "none")+   facet_grid(xx ~ yy, scales='fixed')

Giving us the plot below.

I had initially asked this on StackOverflow where Arun provided a more elegant approach to composing the data.frame

Written by Rajarshi Guha

February 10th, 2013 at 3:03 pm

## Python, permission and forgiveness

with one comment

This morning I was writing some Python code that needed to perform lookups on a very large map.

 1234 mapSize = 65000 amap = {} for i in range(0,mapSize):     amap['k%d' % (i)] = i

If a key did not exist in the map I needed to take some action. My initial effort performed a pre-emptive check to see whether the key existed in the map

 123456789 query = ['k%d' % (x) for x in xrange(100)] query.extend(['k%d' % (x) for x in xrange(mapSize+1, mapSize+100)]) def permission():     for q in query:         if q in amap.keys():             val = amap[q]         else:             pass

Looking up 200 keys (half of which are present and half  are absent from the map) took 496 ms (based on Pythons’ timeit module with default settings). On the other hand, if we just try and go ahead and access the key and deal with its absence by handling with the exception, we get a significant improvement.

 123456 def forgiveness():     for q in query:         try:             val = amap[q]         except:             pass

Specifically, 150 us – an improvement of 3306x.

So, as with many other things in life, it’s better to ask Python for forgiveness than  permission.

Written by Rajarshi Guha

February 7th, 2013 at 4:19 pm

Posted in software

Tagged with

## New version of fingerprint (3.4.9) – faster Dice similarity matrices

I’ve just pushed a new version of the fingerprint package that contains an update provided by Abhik Seal that significantly speeds up calculation of pairwise similarity matrices when using the Dice similarity method. A ran a simple comparison using different numbers of random fingerprints (1024 bits, with 512 bits set to one, randomly) and measured the time to evaluate the pairwise similarity matrix. As you can see from the figure alongside, the new code is significantly faster (with speed ups of 450x to 500x). The code to generate the timings is below – it probably should wrapped in a loop to multiple times for each set size.

 12345 fpls <- lapply(seq(10,300,by=10),                function(i) sapply(1:i,                                   function(x) random.fingerprint(1024, 512))) times <- sapply(fpls,                 function(fpl) system.time(fp.sim.matrix(fpl, method='dice'))[3])

Written by Rajarshi Guha

October 30th, 2012 at 11:10 pm

## CINF Webinar: Practical cheminformatics workflows with mobile apps

I’m pleased to announce that the ACS Division of Chemical Information will be hosting a series of free webinars on topics related to chemical information. The webinars will be open to everybody and our first speaker in this series will be Dr. Alex Clark, who’ll be talking about cheminformatics workflows and mobile applications. More details below or at http://www.acscinf.org/about/news/20120918.php

Webinar: Practical cheminformatics workflows with mobile apps
Date: October 3, 2012
Time: 11 am Eastern time (US)

### Abstract

In recent years smartphones and tablets have attained sufficient power and sophistication to replace conventional desktop and laptop computers for many tasks. Chemistry software is late to the party, but rapidly catching up. This webinar will explore some of the cheminformatics functionality that can currently be performed using mobile apps. A number of workflow scenarios will be discussed, such as: creating and maintaining chemical data (molecules, reactions, numbers & text); searching chemical databases and utilising the results; structure-aware lab notebooks; visualisation and structure-activity analysis; property calculation using remote webservices; and a multitude of ways to share data collaboratively, and integrate modular apps within distributed and heterogeneous workflows.

### Speaker

Alex M. Clark graduated from the University of Auckland, New Zealand, with a Ph.D. in synthetic organometallic chemistry, then went on to work in computational chemistry. His chemistry background spans both the lab bench and development of software for a broad variety of 2D and 3D computer aided molecular design algorithms and user interfaces. He is the founder of Molecular Materials Informatics, Inc., which is dedicated to producing next-generation cheminformatics software for emerging platforms such as mobile devices and cloud computing environments.

Written by Rajarshi Guha

September 27th, 2012 at 10:00 pm

Posted in Uncategorized

Tagged with , ,

## High Content Screens and Multivariate Z’

While contributing to a book chapter on high content screening I came across the problem of characterizing screen quality. In a traditional assay development scenario the Z factor (or Z’) is used as one of the measures of assay performance (using the positive and negative control samples). The definition of Z’ is based on a 1-D readout, which is the case with most non-high content screens. But what happens when we have to deal with 10 or 20 readouts, which can commonly occur in a high content screen?

Assuming one has identified a small set of biologically relevant phenotypic parameters (from the tens or hundreds spit out by HCA software), it makes sense that one measure the assay performance in terms of the overall biology, rather than one specific aspect of the biology. In other words, a useful performance measure should be able to take into account multiple (preferably orthogonal) readouts. In fact, in many high content screening assays, the use of the traditional Z’ with a single readout leads to very low values suggesting a poor quality assay, when in fact, that is not the case if one were to consider the overall biology.

One approach that has been described in the literature is an extension of the Z’, termed the multivariate Z’. The approach was first described by Kummel et al, which develops an LDA model, trained on the positive and negative wells. Each well is described by N phenotypic parameters and the assumption is that one has pre-selected these parameters to be meaningful and relevant. The key to using the model for a Z’ calculation is to replace the N-dimensional values for a given well by the 1-dimensional linear projection of that well:

$P_i = \sum_{j=1}^{D} w_j x_{ij}$

where $P_i$ is the 1-D projected value, $w_j$ is the weight for the $j$’th pheontypic parameter and $x_{ij}$ is the value of the $j$’th parameter for the $i$’th well.

The projected value is then used in the Z’ calculation as usual. Kummel et al showed that this approach leads to better (i.e., higher) Z’ values compared to the use of univariate Z’. Subsequently, Kozak & Csucs extended this approach and used a kernel method to project the N-dimensional well values in a non-linear manner. Unsurprisingly, they show a better Z’ than what would be obtained via a linear projection.

And this is where I have my beef with these methods. In fact, a number of beefs:

• These methods are model based and so can suffer from over-fitting. No checks were made and if over-fitting were to occur one would obtain a falsely optimistic Z’
• These methods assert success when they perform better than a univariate Z’ or when a non-linear projection does better than a linear projection. But neither comparison is a true indication that they have captured the assay performance in an absolute sense. In other words, what is the “ground truth” that one should be aiming for, when developing multivariate Z’ methods? Given that the upper bound of Z’ is 1.0, one can imagine developing methods that give you increasing Z’ values – but does a method that gives Z’ close to 1 really mean a better assay?  It seems that published efforts are measured relative to other implementations and not necessarily to an actual assay quality (however that is characterized).
• While the fundamental idea of separation of positive and negative control reponses as a measure of assay performance is good, methods that are based on learning this separation are at risk of generating overly optimistic assesments of performance.

## A counter-example

As an example, I looked at a recent high content siRNA screen we ran that had 104 parameters associated with it. The first figure shows the Z’ calculated using each layer individually (excluding layers with abnormally low Z’)

As you can see, the highest Z’ is about 0.2. After removing those with no variation and members of correlated pairs I ended up with a set of 15 phenotypic parameters. If we compare the per-parameter distributions of the positive and negative control responses, we see very poor separation in all layers but one, as shown in the density plots below (the scales are all independent)

I then used these 15 parameters to build an LDA model and obtain a multivariate Z’ as described by Kummel et al. Now, the multivariate Z’ turns out to be 0.68, suggesting a well performing assay. I also performed MDS on the 15 parameter set to get lower dimensional (3D, 4D, 5D, 6D etc) datasets and performed the same calculation, leading to similar Z’ values (0.41 – 0.58)

But in fact, from the biological point of view, the assay performance was quite poor due to poor performance of the positive control (we haven’t found a good one yet). In practice then, the model based multivariate Z’ (at least as described by Kummel et al can be misleading. One could argue that I had not chosen an appropriate set of phenotypic parameters – but I checkout a variety of other subsets (though not exhaustively) and I got similar Z’ values.

## Alternatives

Of course, it’s easy to complain and while I haven’t worked out a rigorous alternative, the idea of describing the distance between multivariate distributions as a measure of assay performance (as opposed to learning the separation) allows us to attack the problem in a variety of ways. There is a nice discussion on StackExchange regarding this exact question. Some possibilities include

It might be useful to perform a more comprehensive investigation of these methods as a way to measure assay performance

Written by Rajarshi Guha

September 9th, 2012 at 8:03 pm