While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.
But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).
BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.
What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?
My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.
But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).
I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.