Archive for the ‘descriptor’ tag
The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.
Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.
The core of Adler et al is their statement that
the information content of a feature can be estimated by trying to predict it from the remaining features.
Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.
The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.
Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature \(X_i\) and regress all the other features \(X_j\ (j \neq i)\) on it. Repeating this for all \(X_i\) would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).
Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?
Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it’s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.
A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).
I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.
I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.
The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).
So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.
Version 1.0.5 of the CDK descriptor calculator is now available. This version updates the command line batch mode to allow one to calculate a specific set of descriptors (as opposed to all or say, topological). The selected descriptors are specified using an XML file, which can be generated in the GUI mode – fire up the calculator in GUI mode, check the selected descriptors and then save the selection. You can then specify the selection file via the -s option.
My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.
I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).
I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).
This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.
For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).