Archive for the ‘classification’ tag
The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.
Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.
The core of Adler et al is their statement that
the information content of a feature can be estimated by trying to predict it from the remaining features.
Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.
The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.
Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature \(X_i\) and regress all the other features \(X_j\ (j \neq i)\) on it. Repeating this for all \(X_i\) would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).
Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?
While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.
But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).
BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.
What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?
My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.
But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).
I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.
I’ve just uploaded a new version of the fingerprint package (v3.3) to CRAN that implements some ideas described in Nisius and Bajorath. First, the balance method generates “balanced code” fingerprints, which given an input fingerprint of N bits, returns a new fingerprint of 2N bits, such that the bit density is exactly 50%. Second, bit.importance is a method to evaluate the importance of each bit in a fingerprint, in terms of the Kullback-Liebler divergence between a collection of actives and background molecules. In other words, the method ranks the bits in terms of their ability to discriminate between the actives and the background molecules.