So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘Literature’ Category

vSDC, Rank Products and DUD-E

with 4 comments

This post is a follow-up to my previous discussion on a paper by Chaput et al. The gist of that paper was that in a virtual screening scenario where a small number of hits are to be selected for followup, one could use an ensemble of docking methods, identify compounds whose scores were beyond 2SD of the mean for each method and take the intersection. My post suggested that a non-parametric approach (rank products, RP) performed similarly to the parametric approach of Chaput et al on the two targets they screened.

The authors also performed a benchmark comparison of their consensus method (vSDC) versus the individual docking methods for 102 DUD-E targets. I was able to obtain the individual docking scores (Glide, Surflex, FlexX and GOLD) for each of the targets, with the aim of applying the rank product method described previously.

In short, I reproduced Figure 6A (excluding the curve for vSDC). In
th0this figure, \(n_{test}\) is the number of compounds selected (from the ranked list, either by individual docking scores or by the rank product) and \(T_{h>0}\) is the percentage of targets for which the \(n_{test}\) selected compounds included one or more actives. Code is available here, but you’ll need to get in touch with the authors for the DUD-E docking scores.

As shown alongside, the RP method (as expected) outperforms the individual docking methods. And visual comparison with the original figure suggests that it also outperforms vSDC, especially at lower values of \(n_{test}\). While I wouldn’t regard the better performance of RP compared to vSDC as a huge jump, the absence of a threshold certainly works in its favor.

One could certainly explore ranking approaches in more depth. As suggested by Abhik Seal, Borda or Condorcet methods could be examined (though the small number of docking methods, a.k.a., voter, could be problematic).

UPDATE: After a clarification from Liliane Mouawad it turns out there was a mistake in the ranking of the Surflex docking scores. Correcting that bug fixes my reproduction of Figure 6A so that the curves for individual docking methods match the original. But more interestingly, the performance of RP is now clearly better than every individual method and the vSDC method as well, at all values of \(n_{test}\)

Written by Rajarshi Guha

February 13th, 2016 at 7:25 pm

Hit Selection When You’re Strapped for Cash

with one comment

I came across a paper from Chaput et al that describes an approach to hit selection from a virtual screen (using docking), when follow-up resources are limited (a common scenario in many academic labs). Their approach is based on using multiple docking programs. As they (and others) have pointed out, there is a wide divergence between the rankings of compounds generated using different programs. Hence the motivation for a consensus approach, based on the estimating the standard deviation (SD) of scores generated by a given program and computing the intersection of compounds whose scores are greater than 2 standard deviations from the mean, in each program. Based on this rule, they selected relatively few compounds – just 14 to 22, depending on the target and confirmed at least one of them for each target. This represents less than 0.5% of their screening deck.

However, their method is parametric – you need to select a SD threshold. I was interested in seeing whether a non-parametric, ranking based approach would allow one to retrieve a subset that included the actives identified by the authors. The method is essentially the rank product method applied to the docking scores. That is, the compounds are ranked based on their docking scores and the “ensemble rank” for a compound is the product of its ranks according to each of the four programs. In contrast to the original definition, I used a sum log rank to avoid overflow issues. So the ensemble rank for the \(i\)’th compound is given by

\(R_i = \sum_{j=1}^{4} \log r_{ij}\)

where \(r_{ij}\) is the rank of the \(i\)’th compound in the \(j\)’th docking program. Compounds are then selected based on their ensemble rank. Obviously this doesn’t give you a selection per se. Instead, this allows you to select as many compounds as you want or need. Importantly, it allows you to introduce external factors (cost, synthetic feasibility, ADME properties, etc.) as additional rankings that can be included in the ensemble rank.

Using the docking scores for Calcineurin and Histone Binding Protein (Hbp) provided by Liliane Mouawad (though all the data really should’ve been included in the paper) I applied this method using the code below

d <- read.table('',
                header=TRUE, comment='')
names(d) <- c('molid', 'Surflex', 'Glide', 'Flexx', 'GOLD')
d$GOLD <- -1*d$GOLD ## Since higher scores are better
ranks <- apply(d[,-1], 2, rank)
lranks <- rowSums(log(ranks))
tmp <- data.frame(molid=d[,1], ranks, lrp=rp)
tmp <- tmp[order(tmp$lrp),]
which(str_detect(tmp$molid, 'ACTIVE'))

and identified the single active for Hbp at ensemble rank 8 and the three actives for Calcineurin at ranks 3, 5 and 25. Of course, if you were selecting only the top 3 you would’ve missed the Calcineurin hit and only have gotten 1/3 of the HBP hits. However, as the authors nicely showed, manual inspection of the binding poses is crucial to making an informed selection. The ranking is just a starting point.

Update: Docking scores for Calcineurin and Hbp are now available

Written by Rajarshi Guha

February 5th, 2016 at 1:36 am

Cryptography & Chemical Structure Search

without comments

Encryption of chemical information has not been a very common topic in cheminformatics. There was an ACS symposium in 2005 (summary) that had a number of presentations on the topic of “safe exchange” of chemical information – i.e., exchanging information on chemical structures without sharing the structures themselves. The common thread running through many presentations was to identify representations (a.k.a, descriptors) that can be used for useful computation (e.g., regression or classification models or similarity searches) but do not allow one to (easily) regenerate the structure. Examples include the use of PASS descriptors and various topological indices. Non-descriptor based approaches included, surrogate data (that is structures of related molecules with similar properties) and most recently, scaffold networks. Also, Masek et al, JCIM, 2008 described a procedure to assess the risk of revealing structure information given a set of descriptors.

As indicated by Tetko et al, descriptor based approaches are liable to dictionary based attacks. Theoretically if one fully enumerates all possible molecules and computes the descriptors it would be trivial to obtain the structure of an obfuscated molecule. While this is not currently practical, Masek et al have already shown that an evolutionary algorithm can reconstruct the exact (or closely related) structure from BCUT descriptors in a reasonable time frame and Wong & Burkowski, JCheminf, 2009 described a kernel approach to generating structures from a set of descriptors (though they were considering the inverse QSAR problem rather than chemical privacy). Uptil now I wasn’t aware of approaches that were truly one way – impossible to regenerate the structure from the descriptors, yet also perform useful computations.

Which brings me to an interesting paper by Shimuzu et al which describes a cryptographic approach to chemical structure search, based on homomorphic encryption. A homomorphic encryption scheme allows one to perform computations on the encrypted (usually based on PKI) input leading to an encrypted result, which when decrypted gives the same result as if one had performed the computation on the clear (i.e., unecnrypted) input. Now, a “computation” can involve a variety of operations – addition, multiplication etc. Till recently, most homomorphic schemes were restricted to one or a few operations (and so are termed partially homomorphic). It was only in 2009 that a practical proposal for a fully homomorphic (i.e., supporting arbitrary computations) cryptosystem was described. See this excellent blog post for more details on homomorphic cryptosystems.

The work by Shimuzu et al addresses the specific case of a user trying to identify molecules from a database that are similar to a query structure. They consider a simplified situation where the user is only interested in the count of molecules above a similarity threshold. Two constraints are:

  1. Ensure that the database does not know the actual query structure
  2. The user should not gain information about the database contents (except for number of similar molecules)

Their scheme is based on a additive homomorphic system (i.e., the only operation supported on the encrypted data is addition) and employs binary fingerprints and the Tversky similarity metric (which can be reduced to Tanimoto if required). I note that they used 166-bit MACCS keys. Since it’s small and each bit position is known it seems that some information could leak out of the encrypted fingerprint or be subject to a dictionary attack. I’d have expected that using a larger hashed fingerprint would have helped improve the security. (Though I suspect that the encryption of the query fingerprint alleviates this issue). Another interesting feature, designed to prevent information about the database leaking back to the user is the use of “dummies” – random, encrypted (by the users public key) integers that are mixed with the true (encrypted) query result. Their design allows the user to determine the sign of the query result (which indicates whether the database molecule is similar to the query, above the specified threshold), but does not let them get the actual similarity score. They show that as the number of dummies is increased, the chances of database information leaking out tends towards zero.

Of course, one could argue that the limited usage of proprietary chemical information (in terms of people who have it and people who can make use of it) means that the efforts put in to obfuscation, cryptography etc. could simply be replaced by legal contracts. Certainly, a simple way to address the scenario discussed here (and noted by the authors) is to download the remote database locally. of course this is not feasible if the remote database is meant to stay private (e.g., a competitors structure database).

But nonetheless, methods that rigorously guarantee privacy of chemical information are interesting from an algorithmic standpoint. Even though Shimuzu et al described a very simplistic situation (though the more realistic scenario where the similar database molecules are returned would obviously negate constraint 2 above), it looks like a step forward in terms of applying formal cryptanalysis to chemical problems and supporting truly safe exchange of chemical information.

Written by Rajarshi Guha

January 5th, 2016 at 3:17 am

A Model Building IDE?

with 3 comments

Recently I came across a NIPS2015 paper from Vartak et al that describes a system (APIs + visual frontend) to support the iterative model building process. The problem they are addressing is common one in most machine learning settings – building multiple models (different type) using various features and identifying one or more optimal models to take into production. As they correctly point out, most tools such as scikit-learn, SparkML, etc. focus on providing methods and interfaces to build a single model – it’s up to the user to manage the multiple models, keep track of their performance metrics.


My first reaction was, “why?”. Part of this stems from my use of the R environment which allows me to build up a infrastructure for building multiple models (e.g., caret, e1701), storing them (list of model objects, RData binary files or even pmml) and subsequent comparison and summarization. Naturally, this is specific to me (so not really a general solution) and essentially a series of R commands – I don’t have the ability to monitor model building progress and simultaneously inspect models that have been built.

In that sense, the authors statement,

For the most part, exploration of models takes place sequentially. Data scientists must write (repeated) imperative code to explore models one after another.

is correct (though maybe, said data scientists should improve their programming skills?). So a system that allows one to organize an exploration of model space could be useful. However, other statements such as

  • Without a history of previously trained models, each model must be trained from scratch, wasting computation and increasing response times.
  • In the absence of history, comparison between models becomes challenging and requires the data scientist to write special-purpose code.

seem to be relevant only when you are not already working in an environment that supports model objects and their associated infrastructure. In my workflow, the list of model object represents my history!

Their proposed system called SHERLOCK is built on top of SparkML and provides an API to model building, a database for model storage and a graphical interface (see above) to all of this. While I’m not convinced of some of the design goals (e.g., training variations of models based on previously trained models (e.g., with different feature sets) – wouldn’t you need to retrain the model from scratch if you choose a new feature set?), it’ll be interesting to see how this develops. Certainly, the UI will be key – since it’s pretty easy to generate a multitude of models with a multitude of features, but in the end a human needs to make sense of it.

On a separate note, this sounds like something RStudio could take a stab at.

Written by Rajarshi Guha

December 7th, 2015 at 3:09 am

Summarizing Collections of Curves

without comments


I was browsing live notes from the recent IEEE conference on visualization and came across a paper about functional boxplots. The idea is an extension of the boxplot visualization (shown alongside), to a set of functions. Intuitively, one can think of a functional box plot as specific envelopes for a set of functions. The construction of this plot is based on the notion of band depth (see the more general concept of data depth) which is a measure of how far a given function is from the collection of functions. As described in Sun & Genton the band depth for a given function can be computed by randomly selecting \(J\) functions and identifying wether the given function is contained within the minimum and maximum of the \(J \) functions. Repeating this multiple times, the fraction of times that the given function is fully contained within the \(J\) random functions gives the band depth, \(BD_j\). This is then used to order the functions, allowing one to compute a 50% band, analogous to the IQR in a traditional boxplot. There are more details (choice of \(J\), partial bounding, etc.) described in the papers and links above.

drc-bpMy interest in this approach was piqued since one way of summarizing a dose response screen, or comparing dose response data across multiple conditions is to generate a box plot of a single curve fit parameter – say, \(\log IC_{50} \). But if we wanted to consider the curves themselves, we have a few options. We could simply plot all of them, using translucency to avoid a blob. But this doesn’t scale visually. Another option, on the left, is to draw a series of box plots, one for each dose, and then optionally join the median of each boxplot giving a “median curve”. While these vary in their degree of utility, the idea of summarizing the distribution of a set of curves, and being able to compare these distributions is attractive. Functional box plots look like a way to do this. (A cool thing about functional boxplots is that they can be extended to multivariate functions such as surfaces and so on. See Mirzargar et al for examples)

Computing \(BD_j\) can be time consuming if the number of curves is large or \(J\) is large. Lopez-Pintado & Jornsten suggest a simple optimization to speed up this step, and for the special case of \(J = 2\), Sun et al proposed a ranking based procedure that scales to thousands of curves. The latter is implemented in the fda package for R which also generates the final functional box plots.

As an example I considered 6 cell proliferation assays run in dose response, each one running the same set of compounds, but under different growth conditions. For each assay I only considered good quality curves (giving from 349 to 602 curves). The first plot compares the actives identified in the different growth conditions using the \(\log IC_{50}\), and indicates a statistically significant increase in potency in the last three conditions compared to the first three.


In contrast, the functional box plots for the 6 assays, suggest a somewhat different picture (% Response = 100 corresponds to no cell kill and 0 corresponds to full cell kill).


The red dashed curves correspond to outliers and the blue lines correspond to the ‘maximum’ and ‘minimum’ curves (analogous to the whiskers of the traditional boxplot). Importantly, these are not measured curves, but instead correspond to the dose-wise maximum (and minimum) of the real curves. The pink region represents 50% of the curves and the black line represents the (virtual) median curve. In each case the X-axis corresponds to dose (unlabeled to save space). Personally, I think this visualization is a little cleaner than the dose-wise box plot shown above.

The mess of red lines in the plot 1 suggest an issue with the assay itself. While the other plots do show differences, it’s not clear what one can conclude from this. For example, in the plot for 4, the dip on the left hand side (i.e., low dose) could suggest that there is a degree of cytotoxicity, which is comparatively less in 3, 5 and 6. Interestingly none of the median curves are really sigmoidal, suggesting that the distribution of dose responses has substantial variance.

Written by Rajarshi Guha

November 30th, 2014 at 3:02 pm