BAZOO

So much to do, so little time

Trying to squeeze sense out of chemical data

Competitive Predictive Modeling – How Useful is it?

with 5 comments

While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.

But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).

BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.

What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?

My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.

But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).

I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.

Written by Rajarshi Guha

August 22nd, 2012 at 9:02 pm

5 Responses to 'Competitive Predictive Modeling – How Useful is it?'

Subscribe to comments with RSS or TrackBack to 'Competitive Predictive Modeling – How Useful is it?'.

  1. I fully agree in that a <1% improvement between the top-ranking models is unlikely to be significant. There is a simple correlation between the modeling error and the amount of data. We found in the past (unpublished) that below about 50 training objects, the error rate goes up very quickly and is above 5% points. That means, that any model withing that range are effectively not statistically different.

    Of course, the exact function of this percentage is dependent on very many factors, and one really obvious one the diversity of your data set.

    In my own presentation on semantic pipelines to molecular properties I stressed the validation, and very much like the idea of measuring the prediction error when a model is used as part of a larger/different analysis. That aligns really well with my point that modeling results must be linked/applied to other data sets.

    Egon Willighagen

    23 Aug 12 at 3:01 pm

  2. Thanks Egon. I would have thought the same (and probably is true for smaller datasets). But the presentation from Nigel Duffy was a little surprising. It’s probably not unexpected that a more accurate model leads to better results, but his results suggest that even incremental improves are worth it.

    Of, it’s not clear how incremental a 1e-5 difference in log loss actually is :)

    Rajarshi

    23 Aug 12 at 4:06 pm

  3. (On behalf of David Thompson)

    Thanks for coming along on Sunday, it was nice to finally meet you in
    person! I do have a couple of points to add – both of which I
    mentioned but may have gotten lost as I whipped through the material.

    Firstly, the most important thing you should know about this exercise,
    and which is captured in the first bullet of slide 6 (titled – ‘What
    you should know about this exercise’), is that we wanted to
    investigate the utility of the process. Everything else is a nice to
    have.

    The second point, but clearly related, is that in < 3 months, with no
    domain knowledge, models of a high quality were generated as part of
    the Kaggle process we were investigating. This is pretty clearly
    demonstrated in the slides. You can also reproduce those models if you
    want – one from the top of the private leaderboard is on GitHub if
    you’re interested. Also, as I mentioned, I’m happy to share descriptor
    lists if you’d like to run prospective calculations.

    To reiterate, my focus was the process, if you’d like to structure a
    second challenge to look at your larger question, then I will gladly
    vouch that Kaggle is a very efficient avenue to explore.

    Rajarshi Guha

    23 Aug 12 at 5:19 pm

  4. Yeah, I love to see Nigel’s slides… they question is, however, what is better… if you cannot distinguish two models, how do you know which one is better? That is the problem… when we know the true answer, it’s easy to decide what model was better… but that is not your typical modeling problem. What does Nigel say about application domains, etc? Did you publish these, or related work yet?

    Egon Willighagen

    23 Aug 12 at 6:31 pm

  5. Nigels’ talks were in COMP so I don’t know if they are collecting slides or whether he has made them available

    Rajarshi Guha

    23 Aug 12 at 7:27 pm

Leave a Reply