Archive for the ‘data mining’ tag
So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.
It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.
How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.
Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.
For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.
OK – it sounds cute, but has some obvious limitations
- The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
- A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
- Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.
The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.
Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.
But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.
While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.
But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).
BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.
What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?
My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.
But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).
I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.
I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, so we now get access to SMSD and a few new descriptors.. In addition a some new methods have been included
- cdk.version to get the version of the CDK being used by the package
- is.subgraph uses SMSD to identify substructures. Similar to the pre-exisiting matches method, but much faster in general (though you cannot specify SMARTS)
- get.mcs again, wraps SMSD and allows one to retrieve the MCS (as a molecule object or as atom indices) for a pair of molecules. Once again, should be pretty fast