Archive for the ‘qsar’ tag
Deep learning (DL) is all the rage these days and this approach to predictive modeling is being applied to a wide variety of problems, including many in computational drug discovery. As a dilettante in the area of deep learning, I’ve been following papers that have used DL for cheminformatics problems, and thought I’d mention a few that seemed interesting.
An obvious outcome of a DL model is more accurate predictions, and as a result most applications of DL in drug discovery have focused on the use of DL models as more accurate regression or classification models. Examples include Lusci et al , Xu et al  and Ma et al . It’s interesting to note that in these papers, while DL models show better performance, it’s not consistent and the actual increase in performance is not necessarily very large (for the effort required). Eakins  has reviewed the use of DL models in QSAR settings and more recently Winkler & Le  have also briefly reviewed this area.
However, simply replacing one regression method with another is not particularly interesting. Indeed, as pointed by several workers (e.g., Shao et al ) input descriptors, rather than modeling method, have greater effect on predictive accuracy. And so it’s the topic of representation learning that I think DL methods become interesting and useful in the area of cheminformatics.
Several groups have published work on using DL methods to learn a representation of the molecular structure, directly from the graph representation. Duvenaud et al  and Kearnes et al  both have described these approaches and the nice thing is that this alleviates the need to choose and select features a priori. The downside is that the learned features are optimal in the context of the training data (thus necessitating large training sets to allow for learned features that are generalizable). Interestingly, on reading Kearnes et al , the features that are learned by the DL model are conceptually similar to circular fingerprints. More interestingly, when they built predictive neural network models using the learned representation, the RMSE was not significantly different from a random forest model using circular fingerprints. Of course, the learned representation is driven by the architecture of the DL model, which was designed to look at atom neighborhoods, so it’s probably not too surprising that the optimal representations was essentially equivalent to a circular fingerprint. But one can expect that tweaking the DL architecture and going beyond the molecular graph could lead to more useful representations. Also, this paper very clearly describes the hows and whys of designing a deep neural network architecture, and is useful for someone interested in exploring further.
Another interesting development is the use of DL to learn a continuous representation of a molecular structure, that can then be modified (usually in a manner to vary some molecular property) and “decoded” to obtain a new chemical structure with the desired molecular property. This falls into the class of inverse QSAR problems and Gomez-Bombarelli et al  present a nice example of this approach, where gradient descent is used to explore chemical space defined by the learned continuous representation. Unfortunately the chemistry represented by the generated structures has several problems as described by Derek Lowe. While this problem has been addressed before (e.g., Wong et al  with SVM, Miyao et al , Skvortsova et al ), these efforts have started with pre-defined feature sets. The current works key contribution is the ability to generate a continuous chemical space and I assume the nonsensical regions of the space could be avoided using appropriate filters.
Winkler & Le  recently reported a comparison of deep and shallow neural networks for QSAR regression. Their results and conclusions are similar to previous work. But more tantalizingly, they make the claim that DNN’s may be better suited to tackle the prediction of activity cliffs. There has been some work on this topic (Guha  and Heikamp et al ) but given that activity cliffs are essentially discontinuities in a SAR surface (either fundamentally or by choice of descriptors), traditional predictive models are unlikely to do well. Winkler & Le point to work that suggests that activity cliffs may “disappear” if an appropriately high dimensionality descriptor space is used, and conclude that learned representations via DL may be useful for this. Though I don’t discount this, I’m not convinced that simply moving to higher dimensional spaces is sufficient (or even necessary) – if it were, SVM‘s should be good at predicting activity cliffs. Rather, it’s the correct set of features, that captures the phenomenon underlying the cliff, that are necessary. Nonetheless, Winkler & Le  raise some interesting questions regarding the smoothness of chemical spaces.
The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.
Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.
The core of Adler et al is their statement that
the information content of a feature can be estimated by trying to predict it from the remaining features.
Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.
The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.
Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature \(X_i\) and regress all the other features \(X_j\ (j \neq i)\) on it. Repeating this for all \(X_i\) would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).
Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?
I came across a recent paper from the Tropsha group that discusses the issue of modelability – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of reliable is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.
The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state
we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation
I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!
Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.
The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.
One application of MODI that does come to mind is for feature selection – identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.
More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.
Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.
But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.
Gamification is a hot topic and companies such as Tunedit and Kaggle are succesfully hosting a variety of data mining competitions. These competitions employ data from a variety of domains such as bond trading, essay scoring and so on. Recently, both platforms have hosted a QSAR challenge (though not officially denoted as such). The most recent one is the challenge hosted at Kaggle by Boehringer Ingelheim.
While it’s good to see these competitions raise the profile of “data science” (and make some money for the winners), I must admit that these are not particularly interesting to me as it really boils down to looking at numbers with no context (aka domain knowledge). For example, in the Kaggle & BI example, there are 1,776 descriptors that have been normalized but no indication of the chemistry or biology. One could ask whether a certain mechanism of action is known to play a role in the biology being tested which could suggest a certain class of descriptors over another. Alternatively, one could ask whether there are a few distinct chemotypes present thus suggesting multiple local models versus a single global model. (I suppose that the supplied descriptors may lend themselves to a clustering, but a scaffold based approach would be much more direct and chemically intuitive).
This is not to say that such competitions are useless. On the contrary, lack of domain knowledge doesn’t preclude one from apply sophisticated statistical and machine learning methods to unannotated data and obtaining impressive results. The issue of data versus domain knowledge has been discussed in several places.
In contrast to the currently hosted challenge at Kaggle, an interesting twist would be to try and reverse engineer the structures from their descriptor values. There have been some previous discussions on reverse engineering structures from descriptor data. Obviously, we’re not going to be able to verify our results, but it would be an interesting challenge.