Archive for the ‘HTS’ tag
I came across a paper from Kieth Shockley that describes the use of weighted entropy to rank order dose response curves. As this data type is the bread and butter of my day job a simple ranking method is always of interest to me. However, closer inspection of the paper reveals some fundamental problems.
The paper correctly notes that there is no definitive protocol to rank compounds using their dose response curves. Such rankings are invariably problem dependent – in some cases, simple potency based ranking of good quality curves is sufficient. In other cases structural clustering combined with a measure of potency enrichment is more suitable. In addition, it is also true that all compounds in a screen do not necessarily fit well to a 4-parameter Hill model. This may simply be due to noise but could also be due to some process that is better fit by some other model (bell or U shaped curves). The point being that rankings based on a pre-defined model may not be useful or accurate.
The paper proposes the use of entropy as a way to rank dose response curves in a model-free manner. While a natural approach is to use Shannon entropy, the author suggests that the equal weighting implicit in the calculation is unsuitable. Instead, the use of weighted entropy (WES) is proposed as a more robust approach that takes into account unreliable data points. The author defines the weights based on the level of detection of the assay (though I’d argue that since the intended goal is to capture the reliability of individual response points, a more appropriate weight should be derived from some form of variance – either from replicate data or else pooled across the collection) . The author then suggests that curves should be ranked by the WES value, with higher values indicating a better rank. However, I believe that the use of entropy is not suitable as a ranking procedure and in fact, as my experiments below show, it doesn’t appear to work.
For any proposed ranking scheme, one must first define what the goal is. When ranking dose response curves are we looking for compounds
- that exhibit well defined dose response (top and bottom asymptotes, > 80% efficacy etc)?
- good potency, even if the curve is not that well fit?
- compounds with a specific chemotype?
One of the key omissions of the paper is that it does not explain what the end goal is of the ranking. Entropy of curve data is not equivalent to potency, goodness of fit or other curve characteristics. I suppose one could say that entropy ranking lets one differentiate noise from actual curves (whatever functional form is required to fit them). However this is not necessarily the case as shown in the overlaid density plots shown alongside. The pink region represents WES values calculated from a set of 466 curves whereas the green region represents WES from normally distributed random data (μ = 50, σ = 10). For this case, the WES values from real data (ie measured curves) are completely overlapped by the those derived from random data. Thus in this case, the WES would not differentiate between the two sets of data.
Even ignoring random data, the use of entropy does not reliably differentiate well defined curves, inactive curves and toxic curves. For example, in the figure alongside, the inactive compound exhibits a higher WES than the well defined active curve. The paper does explicitly note that the method was tested on activating curves only, but that should not preclude the use of applying it to inhibitory curves as in this example.
But more fundamentally, if one assumes that the goal of a ranking scheme for dose response curves is to place good quality actives at the top then the propsed WES (or even Shannon entropy, H) does not do the job. One way to to test this is to take a collection of curves, rank them by a measure and identify how many actives are identified in the top N% of the collection, for varying N. Ideally, a good ranking would identify nearly all the actives for a small N. If the ranking were random one would identify N% of the actives in the top N% of the collection. Here an active is defined in terms of curve class, a heuristic that we use to initially weed out poor quality curves and focus on good quality ones. I defined active as curve classes -1.1, -1.2, -2.2 and -2.1. It appears that on four different data sets I looked at, the WES or H do significantly worse than random as shown in the four enrichment curves below (the dashed diagonal corresponds to random ranking).
Instead the ranking scheme that seems to perform consistently better is the AUC (area under the dose response curve). I certainly don’t claim that AUC is a completely robust way to rank dose response curves (in fact for some cases such as invalid curve fits, it is nonsensical). But one would hope that WES does better than random! I also include LAC50, the logarithm of the AC50, as a ranking method simply because the paper considers it a poor way to rank curves (which I agree with, particularly if one does not first filter for good quality, efficacious curves).
Theoretically I see no reason that entropy should correlate with curve quality (as identified by curve class), so I wouldn’t be surprised by a low quality ranking. However, as defined by the paper, the WES is significantly and consistently, poorer than random which is quite surprising.
There are other issues – Table 3 does not seem to be correct. Surely β-testosterone is not an AR agonist with an AC50 of 9.57 x 10-22 μM. In addition, I’m not convinced that a single dataset represents a sufficient validation (given that Tox21 has about 80 published bioassays in PubChem). But in my opinion, this more a sign of sloppy reviewing & editing than anything else.
UPDATE (2/25) – Regenerated the enrichment curves so that data was ranked in the correct order when LAC50 was being used.
So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.
It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.
How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.
Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.
For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.
OK – it sounds cute, but has some obvious limitations
- The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
- A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
- Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.
The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.
Sometime back Baell et al published an interesting paper describing a set of substructure filters to identify compounds that are promiscuous in high throughput biochemical screens. They termed these compounds Pan Assay Interference Compounds or PAINS. There are a variety of functional groups that are known to be problematic in HTS assays. The reasons for exclusion of molecules with these and other groups range from reactivity towards proteins to poor developmental potential or known toxicity. Derek Lowe has a nice summary of the paper.
The paper published the substructure filters as a collection of Sybyl Line Notation (SLN) patterns. Unfortunately, without access to Sybyl, it’s difficult to reuse the published patterns. Having them in SMARTS form would allow one to use them with many more (open source or commercial) tools. Luckily, Wolf Ihlenfeldt came to the rescue and provide me access to a version of the CACTVS toolkit that was able to convert the SLN patterns to SMARTS.
There are three files, p_l15, p_l150 and p_m150 corresponding to tables S8, S7 and S6 from the supplementary information. The first column is the pattern and the second column is the name for that pattern taken from the original SLN files. While all patterns were converted to SMARTS, the conversion process is not perfect as I have not been able to reproduce (using the OEChem toolkit with the Tripos aromaticity model) all the hits that were obtained using the original SLN patterns.
(As a side note, the SMARTSViewer is a really handy tool to visualize a SMARTS pattern – which is great since many of the PAINS patterns are very complex)
A key feature of high throughput screening (HTS) efforts is automation. The NCGC is no stranger to automation, with two Kalypsys robots and a variety of automated components such as liquid handlers and so on. But while the screen itself is automated, the transitions between subsequent steps are not. Thus, after a screen is complete, I will be notified that the data is located in some directory. I’ll then load up the data, process it and end up with a set of compounds for followup. I’d then send the list of compounds to be plated which would then be screened in a follow up assay.
In a number of situations, this approach is unavoidable as the data processing stage requires human intervention (plate corrections, switching controls, etc.). But in some situations, we can automate the whole process – primary screen, automated analysis & compound selection and secondary screen. Given that most screens at NCGC are dose response screens, we can refine an automated pipeline by processing individual plate series (i.e. a collection of plates representing a titration series) rather than waiting for all the plates to be completed. Another important point to note is that the different steps being considered here take different times. Thus screening a plate series might take 15 minutes, processing the resultant data and making selections would take 3 minutes and performing the secondary screen might take 10 minutes. Clearly the three steps have to proceed in the given order – but we don’t necessarily want to wait for each preceding step to be complete. In other words, we need the steps to proceed asynchronously, yet maintain temporal ordering.
One approach to automating such a process is the use of a message queue (MQ). The fundamental idea behind a MQ is that one creates a queue on some machine and then starts one or more processes (likely on some other machines) to send messages to the queue. These messages can then be retrieved by one or more listener processes. MQ systems provide a number of useful features beyond the core functionality of storing and distributing messages – these include message persistence, security policy, routing, batching and so on.
In our case, when a plate series is screened, the robot sends a message to the queue. Some process will be listening to the queue and when it sees a message, pulls it of the queue and processes the data from the screen for that plate series. Once processing is complete, the process sends another message to the queue (or another queue) from which yet another process (this one running on another robot) can pull it off and start the secondary screen on the selected compounds. Thus, as soon as a plate series is finished in the primary screen, we can start the processing and follow up, while the next plate series gets started. A message queue approach is also useful since messages can remain on the queue until the appropriate listener pulls them of for processing. A good queue system will ensure that such messages are delivered reliably and don’t get lost.
The diagram below highlights this approach. The solid lines represent the traditional workflow. Given that we’d manually process the screening data, we’d wait till all plate series are run. The dashed lines represent a message based workflow, in which we can process each plate series independently.
In the next few posts I’ll describe such a message queue based workflow that I’ve been working on these past few days. Currently it’s specific to a screen that we’re going to be running. The infrastructure is written in Java and makes use of Oracle Advanced Queue (AQ) to provide message queues and the facilities for receiving and sending message. I’ll describe a minimal implementation that makes use of Java Messaging Services (JMS) and the standard JMS message types and then follow on with an example using a custom message type that maps to a Oracle user defined type, allowing for more “object oriented” messages.
My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.
I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).
I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).
This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.
For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).