Archive for the ‘HTS’ tag
So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.
It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.
How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.
Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.
For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.
OK – it sounds cute, but has some obvious limitations
- The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
- A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
- Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.
The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.
Sometime back Baell et al published an interesting paper describing a set of substructure filters to identify compounds that are promiscuous in high throughput biochemical screens. They termed these compounds Pan Assay Interference Compounds or PAINS. There are a variety of functional groups that are known to be problematic in HTS assays. The reasons for exclusion of molecules with these and other groups range from reactivity towards proteins to poor developmental potential or known toxicity. Derek Lowe has a nice summary of the paper.
The paper published the substructure filters as a collection of Sybyl Line Notation (SLN) patterns. Unfortunately, without access to Sybyl, it’s difficult to reuse the published patterns. Having them in SMARTS form would allow one to use them with many more (open source or commercial) tools. Luckily, Wolf Ihlenfeldt came to the rescue and provide me access to a version of the CACTVS toolkit that was able to convert the SLN patterns to SMARTS.
There are three files, p_l15, p_l150 and p_m150 corresponding to tables S8, S7 and S6 from the supplementary information. The first column is the pattern and the second column is the name for that pattern taken from the original SLN files. While all patterns were converted to SMARTS, the conversion process is not perfect as I have not been able to reproduce (using the OEChem toolkit with the Tripos aromaticity model) all the hits that were obtained using the original SLN patterns.
(As a side note, the SMARTSViewer is a really handy tool to visualize a SMARTS pattern – which is great since many of the PAINS patterns are very complex)
A key feature of high throughput screening (HTS) efforts is automation. The NCGC is no stranger to automation, with two Kalypsys robots and a variety of automated components such as liquid handlers and so on. But while the screen itself is automated, the transitions between subsequent steps are not. Thus, after a screen is complete, I will be notified that the data is located in some directory. I’ll then load up the data, process it and end up with a set of compounds for followup. I’d then send the list of compounds to be plated which would then be screened in a follow up assay.
In a number of situations, this approach is unavoidable as the data processing stage requires human intervention (plate corrections, switching controls, etc.). But in some situations, we can automate the whole process – primary screen, automated analysis & compound selection and secondary screen. Given that most screens at NCGC are dose response screens, we can refine an automated pipeline by processing individual plate series (i.e. a collection of plates representing a titration series) rather than waiting for all the plates to be completed. Another important point to note is that the different steps being considered here take different times. Thus screening a plate series might take 15 minutes, processing the resultant data and making selections would take 3 minutes and performing the secondary screen might take 10 minutes. Clearly the three steps have to proceed in the given order – but we don’t necessarily want to wait for each preceding step to be complete. In other words, we need the steps to proceed asynchronously, yet maintain temporal ordering.
One approach to automating such a process is the use of a message queue (MQ). The fundamental idea behind a MQ is that one creates a queue on some machine and then starts one or more processes (likely on some other machines) to send messages to the queue. These messages can then be retrieved by one or more listener processes. MQ systems provide a number of useful features beyond the core functionality of storing and distributing messages – these include message persistence, security policy, routing, batching and so on.
In our case, when a plate series is screened, the robot sends a message to the queue. Some process will be listening to the queue and when it sees a message, pulls it of the queue and processes the data from the screen for that plate series. Once processing is complete, the process sends another message to the queue (or another queue) from which yet another process (this one running on another robot) can pull it off and start the secondary screen on the selected compounds. Thus, as soon as a plate series is finished in the primary screen, we can start the processing and follow up, while the next plate series gets started. A message queue approach is also useful since messages can remain on the queue until the appropriate listener pulls them of for processing. A good queue system will ensure that such messages are delivered reliably and don’t get lost.
The diagram below highlights this approach. The solid lines represent the traditional workflow. Given that we’d manually process the screening data, we’d wait till all plate series are run. The dashed lines represent a message based workflow, in which we can process each plate series independently.
In the next few posts I’ll describe such a message queue based workflow that I’ve been working on these past few days. Currently it’s specific to a screen that we’re going to be running. The infrastructure is written in Java and makes use of Oracle Advanced Queue (AQ) to provide message queues and the facilities for receiving and sending message. I’ll describe a minimal implementation that makes use of Java Messaging Services (JMS) and the standard JMS message types and then follow on with an example using a custom message type that maps to a Oracle user defined type, allowing for more “object oriented” messages.
My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.
I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).
I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).
This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.
For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).
A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.
There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.
I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.
The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.