Archive for the ‘data science’ tag
I came across a paper from Chaput et al that describes an approach to hit selection from a virtual screen (using docking), when follow-up resources are limited (a common scenario in many academic labs). Their approach is based on using multiple docking programs. As they (and others) have pointed out, there is a wide divergence between the rankings of compounds generated using different programs. Hence the motivation for a consensus approach, based on the estimating the standard deviation (SD) of scores generated by a given program and computing the intersection of compounds whose scores are greater than 2 standard deviations from the mean, in each program. Based on this rule, they selected relatively few compounds – just 14 to 22, depending on the target and confirmed at least one of them for each target. This represents less than 0.5% of their screening deck.
However, their method is parametric – you need to select a SD threshold. I was interested in seeing whether a non-parametric, ranking based approach would allow one to retrieve a subset that included the actives identified by the authors. The method is essentially the rank product method applied to the docking scores. That is, the compounds are ranked based on their docking scores and the “ensemble rank” for a compound is the product of its ranks according to each of the four programs. In contrast to the original definition, I used a sum log rank to avoid overflow issues. So the ensemble rank for the ‘th compound is given by
where is the rank of the ‘th compound in the ‘th docking program. Compounds are then selected based on their ensemble rank. Obviously this doesn’t give you a selection per se. Instead, this allows you to select as many compounds as you want or need. Importantly, it allows you to introduce external factors (cost, synthetic feasibility, ADME properties, etc.) as additional rankings that can be included in the ensemble rank.
Using the docking scores for Calcineurin and Histone Binding Protein (Hbp) provided by Liliane Mouawad (though all the data really should’ve been included in the paper) I applied this method using the code below
d <- read.table('http://cmib.curie.fr/sites/u759/files/document/score_vs_cn.txt',
names(d) <- c('molid', 'Surflex', 'Glide', 'Flexx', 'GOLD')
d$GOLD <- -1*d$GOLD ## Since higher scores are better
ranks <- apply(d[,-1], 2, rank)
lranks <- rowSums(log(ranks))
tmp <- data.frame(molid=d[,1], ranks, lrp=rp)
tmp <- tmp[order(tmp$lrp),]
and identified the single active for Hbp at ensemble rank 8 and the three actives for Calcineurin at ranks 3, 5 and 25. Of course, if you were selecting only the top 3 you would’ve missed the Calcineurin hit and only have gotten 1/3 of the HBP hits. However, as the authors nicely showed, manual inspection of the binding poses is crucial to making an informed selection. The ranking is just a starting point.
Update: Docking scores for Calcineurin and Hbp are now available
Gamification is a hot topic and companies such as Tunedit and Kaggle are succesfully hosting a variety of data mining competitions. These competitions employ data from a variety of domains such as bond trading, essay scoring and so on. Recently, both platforms have hosted a QSAR challenge (though not officially denoted as such). The most recent one is the challenge hosted at Kaggle by Boehringer Ingelheim.
While it’s good to see these competitions raise the profile of “data science” (and make some money for the winners), I must admit that these are not particularly interesting to me as it really boils down to looking at numbers with no context (aka domain knowledge). For example, in the Kaggle & BI example, there are 1,776 descriptors that have been normalized but no indication of the chemistry or biology. One could ask whether a certain mechanism of action is known to play a role in the biology being tested which could suggest a certain class of descriptors over another. Alternatively, one could ask whether there are a few distinct chemotypes present thus suggesting multiple local models versus a single global model. (I suppose that the supplied descriptors may lend themselves to a clustering, but a scaffold based approach would be much more direct and chemically intuitive).
This is not to say that such competitions are useless. On the contrary, lack of domain knowledge doesn’t preclude one from apply sophisticated statistical and machine learning methods to unannotated data and obtaining impressive results. The issue of data versus domain knowledge has been discussed in several places.
In contrast to the currently hosted challenge at Kaggle, an interesting twist would be to try and reverse engineer the structures from their descriptor values. There have been some previous discussions on reverse engineering structures from descriptor data. Obviously, we’re not going to be able to verify our results, but it would be an interesting challenge.