# So much to do, so little time

Trying to squeeze sense out of chemical data

## High Content Screens and Multivariate Z’

While contributing to a book chapter on high content screening I came across the problem of characterizing screen quality. In a traditional assay development scenario the Z factor (or Z’) is used as one of the measures of assay performance (using the positive and negative control samples). The definition of Z’ is based on a 1-D readout, which is the case with most non-high content screens. But what happens when we have to deal with 10 or 20 readouts, which can commonly occur in a high content screen?

Assuming one has identified a small set of biologically relevant phenotypic parameters (from the tens or hundreds spit out by HCA software), it makes sense that one measure the assay performance in terms of the overall biology, rather than one specific aspect of the biology. In other words, a useful performance measure should be able to take into account multiple (preferably orthogonal) readouts. In fact, in many high content screening assays, the use of the traditional Z’ with a single readout leads to very low values suggesting a poor quality assay, when in fact, that is not the case if one were to consider the overall biology.

One approach that has been described in the literature is an extension of the Z’, termed the multivariate Z’. The approach was first described by Kummel et al, which develops an LDA model, trained on the positive and negative wells. Each well is described by N phenotypic parameters and the assumption is that one has pre-selected these parameters to be meaningful and relevant. The key to using the model for a Z’ calculation is to replace the N-dimensional values for a given well by the 1-dimensional linear projection of that well:

$P_i = \sum_{j=1}^{D} w_j x_{ij}$

where $P_i$ is the 1-D projected value, $w_j$ is the weight for the $j$‘th pheontypic parameter and $x_{ij}$ is the value of the $j$‘th parameter for the $i$‘th well.

The projected value is then used in the Z’ calculation as usual. Kummel et al showed that this approach leads to better (i.e., higher) Z’ values compared to the use of univariate Z’. Subsequently, Kozak & Csucs extended this approach and used a kernel method to project the N-dimensional well values in a non-linear manner. Unsurprisingly, they show a better Z’ than what would be obtained via a linear projection.

And this is where I have my beef with these methods. In fact, a number of beefs:

• These methods are model based and so can suffer from over-fitting. No checks were made and if over-fitting were to occur one would obtain a falsely optimistic Z’
• These methods assert success when they perform better than a univariate Z’ or when a non-linear projection does better than a linear projection. But neither comparison is a true indication that they have captured the assay performance in an absolute sense. In other words, what is the “ground truth” that one should be aiming for, when developing multivariate Z’ methods? Given that the upper bound of Z’ is 1.0, one can imagine developing methods that give you increasing Z’ values – but does a method that gives Z’ close to 1 really mean a better assay?  It seems that published efforts are measured relative to other implementations and not necessarily to an actual assay quality (however that is characterized).
• While the fundamental idea of separation of positive and negative control reponses as a measure of assay performance is good, methods that are based on learning this separation are at risk of generating overly optimistic assesments of performance.

## A counter-example

As an example, I looked at a recent high content siRNA screen we ran that had 104 parameters associated with it. The first figure shows the Z’ calculated using each layer individually (excluding layers with abnormally low Z’)

As you can see, the highest Z’ is about 0.2. After removing those with no variation and members of correlated pairs I ended up with a set of 15 phenotypic parameters. If we compare the per-parameter distributions of the positive and negative control responses, we see very poor separation in all layers but one, as shown in the density plots below (the scales are all independent)

I then used these 15 parameters to build an LDA model and obtain a multivariate Z’ as described by Kummel et al. Now, the multivariate Z’ turns out to be 0.68, suggesting a well performing assay. I also performed MDS on the 15 parameter set to get lower dimensional (3D, 4D, 5D, 6D etc) datasets and performed the same calculation, leading to similar Z’ values (0.41 – 0.58)

But in fact, from the biological point of view, the assay performance was quite poor due to poor performance of the positive control (we haven’t found a good one yet). In practice then, the model based multivariate Z’ (at least as described by Kummel et al can be misleading. One could argue that I had not chosen an appropriate set of phenotypic parameters – but I checkout a variety of other subsets (though not exhaustively) and I got similar Z’ values.

## Alternatives

Of course, it’s easy to complain and while I haven’t worked out a rigorous alternative, the idea of describing the distance between multivariate distributions as a measure of assay performance (as opposed to learning the separation) allows us to attack the problem in a variety of ways. There is a nice discussion on StackExchange regarding this exact question. Some possibilities include

It might be useful to perform a more comprehensive investigation of these methods as a way to measure assay performance

Written by Rajarshi Guha

September 9th, 2012 at 8:03 pm

## Software for the “Federation of Independent Scientists”

A few days back, Derek Lowe posted a comment from a reader who suggested a way to approach the current employment challenges in the pharmaceutical industry would be the formation of a Federation of Independent Scientists. Such a federation would be open to consultants, small companies etc and would use its size to obtain group rates on various things – journal access, health insurance and so on. Obviously, there’s a lot of details left out here and when you go in the nitty gritty a lot of issues arise that don’t have simple answers. Nevertheless, an interesting (and welcome, as evidenced by the comment thread) idea.

One aspect raised by a commenter was access to modeling and docking software by such a group. He mentioned that he’d

… like to see an open source initiative develop a free, open source drug discovery package.Why not, all the underlying force fields and QM models have been published … it would just take a team of dedicated programmers and computational chemists time and passion to create it.

This is the very essence of the Blue Obelisk movement, under whose umbrella there is now a wide variety of computation chemistry and cheminformatics software. There’s certainly no lack of passion in the Open Source chemistry software community. As most of it is based on volunteer effort, time is always an issue. This has a direct effect on the features provided by Open Source chemistry software – such software does not always match up to commercial tools. But as the commenter above pointed out, much of the algorithms underlying proprietrary software is published. It just needs somebody with the time and expertise to implement them. And the combination of these two (in the absence of funding) is not always easy to find.

Of course, having access to the software is just one step. A scientists requires (possibly significant) hardware resources to run the software. Another comment raised this issue and asked about the possibility of a cloud based install of comp chem software.

With regards the sophisticated modelling tools – do they have to be locally installed?

How do the big pharma companies deploy the software now? I would be very suprised if it wasn’t easily packaged, although I guess the number of people using it is limited.

I’m thinking of some kind of virtual server, or remote desktop style operation. Your individual contractor can connect from whereever, and have full access to a range of tools, then transfer their data back to their own location for safekeeping.

Unlike CloudBioLinux, which provides a collection of bioinformatics and structural biology software as a prepackaged AMI for Amazons EC2 platform, I’m not aware of a similarly prepackaged set of Open Source tools for chemistry. And certainly not based on the cloud. (There are some companies that host comp chem software on the cloud and provide access to these installations for a fee). While some Linux distribibutions do package a number of scientific packages (UbuntuScience for example), I don’t think that these would support a computational drug discovery operation. (The above comment does’nt necessarily focus just on Open Source software. One could consider commercial software hosted on remote servers, though I wonder what type of licensing would be involved).

The last component would be the issue of data, primarily for cloud based solutions. While compute cycles on such platforms are usually cheap, bandwidth can be expensive. Granted, chemical data is not as big as biological data (cf. 1000Genomes on AWS), but sending a large collection of conformers over the network may not be very cost-effective. One way to bypass this would be to generate “standard” conformer collections and other such libraries and host them on the cloud. But what is “standard” and who would pay for hosting costs is an open question.

But I do think there is a sufficiently rich ecosystem of Open Source software that could serve much of the computational needs of a “Federation of Independent Scientists”. It’d be interesting to put together a list of Open Source based on requirements from the the commenters in that thread.

Written by Rajarshi Guha

April 14th, 2012 at 9:23 pm