I’d Rather Be … Reverse Engineering

Gamification is a hot topic and companies such as Tunedit and Kaggle are succesfully hosting a variety of data mining competitions. These competitions employ data from a variety of domains such as bond trading, essay scoring and so on. Recently, both platforms have hosted a QSAR challenge (though not officially denoted as such). The most recent one is the challenge hosted at Kaggle by Boehringer Ingelheim.

While it’s good to see these competitions raise the profile of “data science” (and make some money for the winners), I must admit that these are not particularly interesting to me as it really boils down to looking at numbers with no context (aka domain knowledge). For example, in the Kaggle & BI example, there are 1,776 descriptors that have been normalized but no indication of the chemistry or biology. One could ask whether a certain mechanism of action is known to play a role in the biology being tested which could suggest a certain class of descriptors over another. Alternatively, one could ask whether there are a few distinct chemotypes present thus suggesting multiple local models versus a single global model. (I suppose that the supplied descriptors may lend themselves to a clustering, but a scaffold based approach would be much more direct and chemically intuitive).

This is not to say that such competitions are useless. On the contrary, lack of domain knowledge doesn’t preclude one from apply sophisticated statistical and machine learning methods to unannotated data and obtaining impressive results. The issue of data versus domain knowledge has been discussed in several places.

In contrast to the currently hosted challenge at Kaggle, an interesting twist would be to try and reverse engineer the structures from their descriptor values. There have been some previous discussions on reverse engineering structures from descriptor data. Obviously, we’re not going to be able to verify our results, but it would be an interesting challenge.

3 thoughts on “I’d Rather Be … Reverse Engineering

  1. Reverse engineering structures from descriptor data kind of requires the software to be freely available for people to use. Because it becomes a daunting job if you had to do it with other tools…

  2. 96well says:

    You may also find interesting this article from a very different domain background (endocrinology). There is a sort of proof-of-concept that reverse engineering structures from biological data is possible. Instead of using molecular descriptors, some few ‘biological descriptors’ are shown to be dependent of the drug structure.

    Rando et al., 2010 Molecular Endocrinology vol. 24 no. 4 735-744

  3. In my opinion they gathering the machine learning approaches for particular problem.

Leave a Reply

Your email address will not be published. Required fields are marked *