Archive for the ‘qsar’ tag
The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.
Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.
The core of Adler et al is their statement that
the information content of a feature can be estimated by trying to predict it from the remaining features.
Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.
The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.
Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature \(X_i\) and regress all the other features \(X_j\ (j \neq i)\) on it. Repeating this for all \(X_i\) would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).
Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?
I came across a recent paper from the Tropsha group that discusses the issue of modelability – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of reliable is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.
The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state
we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation
I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!
Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.
The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.
One application of MODI that does come to mind is for feature selection – identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.
More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.
Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.
But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.
Gamification is a hot topic and companies such as Tunedit and Kaggle are succesfully hosting a variety of data mining competitions. These competitions employ data from a variety of domains such as bond trading, essay scoring and so on. Recently, both platforms have hosted a QSAR challenge (though not officially denoted as such). The most recent one is the challenge hosted at Kaggle by Boehringer Ingelheim.
While it’s good to see these competitions raise the profile of “data science” (and make some money for the winners), I must admit that these are not particularly interesting to me as it really boils down to looking at numbers with no context (aka domain knowledge). For example, in the Kaggle & BI example, there are 1,776 descriptors that have been normalized but no indication of the chemistry or biology. One could ask whether a certain mechanism of action is known to play a role in the biology being tested which could suggest a certain class of descriptors over another. Alternatively, one could ask whether there are a few distinct chemotypes present thus suggesting multiple local models versus a single global model. (I suppose that the supplied descriptors may lend themselves to a clustering, but a scaffold based approach would be much more direct and chemically intuitive).
This is not to say that such competitions are useless. On the contrary, lack of domain knowledge doesn’t preclude one from apply sophisticated statistical and machine learning methods to unannotated data and obtaining impressive results. The issue of data versus domain knowledge has been discussed in several places.
In contrast to the currently hosted challenge at Kaggle, an interesting twist would be to try and reverse engineer the structures from their descriptor values. There have been some previous discussions on reverse engineering structures from descriptor data. Obviously, we’re not going to be able to verify our results, but it would be an interesting challenge.
Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it’s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.