# So much to do, so little time

Trying to squeeze sense out of chemical data

## Hit Selection When You’re Strapped for Cash

I came across a paper from Chaput et al that describes an approach to hit selection from a virtual screen (using docking), when follow-up resources are limited (a common scenario in many academic labs). Their approach is based on using multiple docking programs. As they (and others) have pointed out, there is a wide divergence between the rankings of compounds generated using different programs. Hence the motivation for a consensus approach, based on the estimating the standard deviation (SD) of scores generated by a given program and computing the intersection of compounds whose scores are greater than 2 standard deviations from the mean, in each program. Based on this rule, they selected relatively few compounds – just 14 to 22, depending on the target and confirmed at least one of them for each target. This represents less than 0.5% of their screening deck.

However, their method is parametric – you need to select a SD threshold. I was interested in seeing whether a non-parametric, ranking based approach would allow one to retrieve a subset that included the actives identified by the authors. The method is essentially the rank product method applied to the docking scores. That is, the compounds are ranked based on their docking scores and the “ensemble rank” for a compound is the product of its ranks according to each of the four programs. In contrast to the original definition, I used a sum log rank to avoid overflow issues. So the ensemble rank for the $i$‘th compound is given by

$R_i = \sum_{j=1}^{4} \log r_{ij}$

where $r_{ij}$ is the rank of the $i$‘th compound in the $j$‘th docking program. Compounds are then selected based on their ensemble rank. Obviously this doesn’t give you a selection per se. Instead, this allows you to select as many compounds as you want or need. Importantly, it allows you to introduce external factors (cost, synthetic feasibility, ADME properties, etc.) as additional rankings that can be included in the ensemble rank.

Using the docking scores for Calcineurin and Histone Binding Protein (Hbp) provided by Liliane Mouawad (though all the data really should’ve been included in the paper) I applied this method using the code below

 12345678910 library(stringr) d <- read.table('http://cmib.curie.fr/sites/u759/files/document/score_vs_cn.txt',                 header=TRUE, comment='') names(d) <- c('molid', 'Surflex', 'Glide', 'Flexx', 'GOLD') d$GOLD <- -1*d$GOLD ## Since higher scores are better ranks <- apply(d[,-1], 2, rank) lranks <- rowSums(log(ranks)) tmp <- data.frame(molid=d[,1], ranks, lrp=rp) tmp <- tmp[order(tmp$lrp),] which(str_detect(tmp$molid, 'ACTIVE'))

and identified the single active for Hbp at ensemble rank 8 and the three actives for Calcineurin at ranks 3, 5 and 25. Of course, if you were selecting only the top 3 you would’ve missed the Calcineurin hit and only have gotten 1/3 of the HBP hits. However, as the authors nicely showed, manual inspection of the binding poses is crucial to making an informed selection. The ranking is just a starting point.

Update: Docking scores for Calcineurin and Hbp are now available

Written by Rajarshi Guha

February 5th, 2016 at 1:36 am

## Cryptography & Chemical Structure Search

Encryption of chemical information has not been a very common topic in cheminformatics. There was an ACS symposium in 2005 (summary) that had a number of presentations on the topic of “safe exchange” of chemical information – i.e., exchanging information on chemical structures without sharing the structures themselves. The common thread running through many presentations was to identify representations (a.k.a, descriptors) that can be used for useful computation (e.g., regression or classification models or similarity searches) but do not allow one to (easily) regenerate the structure. Examples include the use of PASS descriptors and various topological indices. Non-descriptor based approaches included, surrogate data (that is structures of related molecules with similar properties) and most recently, scaffold networks. Also, Masek et al, JCIM, 2008 described a procedure to assess the risk of revealing structure information given a set of descriptors.

As indicated by Tetko et al, descriptor based approaches are liable to dictionary based attacks. Theoretically if one fully enumerates all possible molecules and computes the descriptors it would be trivial to obtain the structure of an obfuscated molecule. While this is not currently practical, Masek et al have already shown that an evolutionary algorithm can reconstruct the exact (or closely related) structure from BCUT descriptors in a reasonable time frame and Wong & Burkowski, JCheminf, 2009 described a kernel approach to generating structures from a set of descriptors (though they were considering the inverse QSAR problem rather than chemical privacy). Uptil now I wasn’t aware of approaches that were truly one way – impossible to regenerate the structure from the descriptors, yet also perform useful computations.

Which brings me to an interesting paper by Shimuzu et al which describes a cryptographic approach to chemical structure search, based on homomorphic encryption. A homomorphic encryption scheme allows one to perform computations on the encrypted (usually based on PKI) input leading to an encrypted result, which when decrypted gives the same result as if one had performed the computation on the clear (i.e., unecnrypted) input. Now, a “computation” can involve a variety of operations – addition, multiplication etc. Till recently, most homomorphic schemes were restricted to one or a few operations (and so are termed partially homomorphic). It was only in 2009 that a practical proposal for a fully homomorphic (i.e., supporting arbitrary computations) cryptosystem was described. See this excellent blog post for more details on homomorphic cryptosystems.

The work by Shimuzu et al addresses the specific case of a user trying to identify molecules from a database that are similar to a query structure. They consider a simplified situation where the user is only interested in the count of molecules above a similarity threshold. Two constraints are:

1. Ensure that the database does not know the actual query structure
2. The user should not gain information about the database contents (except for number of similar molecules)

Their scheme is based on a additive homomorphic system (i.e., the only operation supported on the encrypted data is addition) and employs binary fingerprints and the Tversky similarity metric (which can be reduced to Tanimoto if required). I note that they used 166-bit MACCS keys. Since it’s small and each bit position is known it seems that some information could leak out of the encrypted fingerprint or be subject to a dictionary attack. I’d have expected that using a larger hashed fingerprint would have helped improve the security. (Though I suspect that the encryption of the query fingerprint alleviates this issue). Another interesting feature, designed to prevent information about the database leaking back to the user is the use of “dummies” – random, encrypted (by the users public key) integers that are mixed with the true (encrypted) query result. Their design allows the user to determine the sign of the query result (which indicates whether the database molecule is similar to the query, above the specified threshold), but does not let them get the actual similarity score. They show that as the number of dummies is increased, the chances of database information leaking out tends towards zero.

Of course, one could argue that the limited usage of proprietary chemical information (in terms of people who have it and people who can make use of it) means that the efforts put in to obfuscation, cryptography etc. could simply be replaced by legal contracts. Certainly, a simple way to address the scenario discussed here (and noted by the authors) is to download the remote database locally. of course this is not feasible if the remote database is meant to stay private (e.g., a competitors structure database).

But nonetheless, methods that rigorously guarantee privacy of chemical information are interesting from an algorithmic standpoint. Even though Shimuzu et al described a very simplistic situation (though the more realistic scenario where the similar database molecules are returned would obviously negate constraint 2 above), it looks like a step forward in terms of applying formal cryptanalysis to chemical problems and supporting truly safe exchange of chemical information.

Written by Rajarshi Guha

January 5th, 2016 at 3:17 am

## A Model Building IDE?

Recently I came across a NIPS2015 paper from Vartak et al that describes a system (APIs + visual frontend) to support the iterative model building process. The problem they are addressing is common one in most machine learning settings – building multiple models (different type) using various features and identifying one or more optimal models to take into production. As they correctly point out, most tools such as scikit-learn, SparkML, etc. focus on providing methods and interfaces to build a single model – it’s up to the user to manage the multiple models, keep track of their performance metrics.

My first reaction was, “why?”. Part of this stems from my use of the R environment which allows me to build up a infrastructure for building multiple models (e.g., caret, e1701), storing them (list of model objects, RData binary files or even pmml) and subsequent comparison and summarization. Naturally, this is specific to me (so not really a general solution) and essentially a series of R commands – I don’t have the ability to monitor model building progress and simultaneously inspect models that have been built.

In that sense, the authors statement,

For the most part, exploration of models takes place sequentially. Data scientists must write (repeated) imperative code to explore models one after another.

is correct (though maybe, said data scientists should improve their programming skills?). So a system that allows one to organize an exploration of model space could be useful. However, other statements such as

• Without a history of previously trained models, each model must be trained from scratch, wasting computation and increasing response times.
• In the absence of history, comparison between models becomes challenging and requires the data scientist to write special-purpose code.

seem to be relevant only when you are not already working in an environment that supports model objects and their associated infrastructure. In my workflow, the list of model object represents my history!

Their proposed system called SHERLOCK is built on top of SparkML and provides an API to model building, a database for model storage and a graphical interface (see above) to all of this. While I’m not convinced of some of the design goals (e.g., training variations of models based on previously trained models (e.g., with different feature sets) – wouldn’t you need to retrain the model from scratch if you choose a new feature set?), it’ll be interesting to see how this develops. Certainly, the UI will be key – since it’s pretty easy to generate a multitude of models with a multitude of features, but in the end a human needs to make sense of it.

On a separate note, this sounds like something RStudio could take a stab at.

Written by Rajarshi Guha

December 7th, 2015 at 3:09 am

## Summarizing Collections of Curves

I was browsing live notes from the recent IEEE conference on visualization and came across a paper about functional boxplots. The idea is an extension of the boxplot visualization (shown alongside), to a set of functions. Intuitively, one can think of a functional box plot as specific envelopes for a set of functions. The construction of this plot is based on the notion of band depth (see the more general concept of data depth) which is a measure of how far a given function is from the collection of functions. As described in Sun & Genton the band depth for a given function can be computed by randomly selecting $J$ functions and identifying wether the given function is contained within the minimum and maximum of the $J$ functions. Repeating this multiple times, the fraction of times that the given function is fully contained within the $J$ random functions gives the band depth, $BD_j$. This is then used to order the functions, allowing one to compute a 50% band, analogous to the IQR in a traditional boxplot. There are more details (choice of $J$, partial bounding, etc.) described in the papers and links above.

My interest in this approach was piqued since one way of summarizing a dose response screen, or comparing dose response data across multiple conditions is to generate a box plot of a single curve fit parameter – say, $\log IC_{50}$. But if we wanted to consider the curves themselves, we have a few options. We could simply plot all of them, using translucency to avoid a blob. But this doesn’t scale visually. Another option, on the left, is to draw a series of box plots, one for each dose, and then optionally join the median of each boxplot giving a “median curve”. While these vary in their degree of utility, the idea of summarizing the distribution of a set of curves, and being able to compare these distributions is attractive. Functional box plots look like a way to do this. (A cool thing about functional boxplots is that they can be extended to multivariate functions such as surfaces and so on. See Mirzargar et al for examples)

Computing $BD_j$ can be time consuming if the number of curves is large or $J$ is large. Lopez-Pintado & Jornsten suggest a simple optimization to speed up this step, and for the special case of $J = 2$, Sun et al proposed a ranking based procedure that scales to thousands of curves. The latter is implemented in the fda package for R which also generates the final functional box plots.

As an example I considered 6 cell proliferation assays run in dose response, each one running the same set of compounds, but under different growth conditions. For each assay I only considered good quality curves (giving from 349 to 602 curves). The first plot compares the actives identified in the different growth conditions using the $\log IC_{50}$, and indicates a statistically significant increase in potency in the last three conditions compared to the first three.

In contrast, the functional box plots for the 6 assays, suggest a somewhat different picture (% Response = 100 corresponds to no cell kill and 0 corresponds to full cell kill).

The red dashed curves correspond to outliers and the blue lines correspond to the ‘maximum’ and ‘minimum’ curves (analogous to the whiskers of the traditional boxplot). Importantly, these are not measured curves, but instead correspond to the dose-wise maximum (and minimum) of the real curves. The pink region represents 50% of the curves and the black line represents the (virtual) median curve. In each case the X-axis corresponds to dose (unlabeled to save space). Personally, I think this visualization is a little cleaner than the dose-wise box plot shown above.

The mess of red lines in the plot 1 suggest an issue with the assay itself. While the other plots do show differences, it’s not clear what one can conclude from this. For example, in the plot for 4, the dip on the left hand side (i.e., low dose) could suggest that there is a degree of cytotoxicity, which is comparatively less in 3, 5 and 6. Interestingly none of the median curves are really sigmoidal, suggesting that the distribution of dose responses has substantial variance.

Written by Rajarshi Guha

November 30th, 2014 at 3:02 pm

## Thoughts on the DREAM Synergy Prediction Challenge

The DREAM consortium has run a number of predictive modeling challenges and the latest one on predicting small molecule synergies has just been published. The dataset that was provided included baseline gene expression of the cell line (OCI-LY3), expression in presence of compound (2 concentrations, 2 time points), dose response data for 14 compounds and the excess over Bliss for the 91 pairs formed from the 14 compounds. Based on this data (and available literature data) participants had to predict a ranking for the 91 combinations.

The paper reports the results of 31 approaches (plus one method that was not compared to the others) and does a good job of summarizing their performance and identifying whether certain data type or certain approaches work better than others. They also investigated the performance of an ensemble of approaches, which, as one might expect, worked better than the single methods. While the importance of gene expression in predictive performance was not as great as I would’ve thought, it was certainly more useful than chemical structure alone. Interestingly, they also noted that “compounds with more targeted mechanisms, such as rapamycin and blebbistatin, were least synergistic“. I suspect that this is somewhat dataset specific, but it will be interesting to see whether this holds in large collections of combination experiment such as those run at NCATS.

Overall, it’s an important contribution with the key take home message being

… synergy and antagonism are highly context specific and are thus not universal properties of the compounds’ chemical, structural or substrate information. As a result, predictive methods that account for the genetics and regulatory architecture of the context will become increasingly relevant to generalize results across multiple contexts

Given the relative dearth of predictive models of compound synergy, this paper is a nice compilation of methods. But there are some issues that weaken the paper.

• One key issue are the conclusions on model performance. The organizers defined a score, termed probabilistic c-score (PC score). If I understand correctly, a random ranking should give PC = 0.5. It turns out that the best performing method exhibited a PC score = 0.61 with a number of methods hovering around 0.5. Undoubtably, this is a tough problem, but when the authors states that “… this challenge shows that current methodologies can perform significantly better than chance …” I raise an eyebrow. I can only assume that what they meant was that the results were “statistically significantly better than chance“, because in terms of effect size the results are not impressive. After reading this excellent article on p-values and significance testing I’m particularly sensitized to claims of significance.
• The dataset could have been strengthened by the inclusion of self-crosses. This would’ve allowed the authors to assess actual excess over Bliss values corresponding to additivity (which will not be exactly 0 due to experimental noise), and avoid the use of cutoffs in determining what is synergistic or antagonistic.
• Similarly, a key piece of data that would really strengthen these approaches is the expression data in presence of combinations. While it’s unreasonable to have this data available for all combinations, it could be used as a first step in developing models to predict the expression profile in presence of combination treatment. Certainly, such data could be used to validate some assumptions made by some of the models described (e.g., concordance of DEG’s induced by single agents implies synergistic response).
• Kudos for including source code for the top methods, but would’ve been nicer if data files were included so we could actually reproduce the results.
• The authors conclude that when designing new synergy experiments, one should identify mechanistically diverse molecules to make up for the “small number of potentially synergistic pathways“. While mechanistic diversity is a good idea, it’s not clear how they conclude there are a small number of pathways that play a role in synergy.
• It’s a pity that the SynGen method was not compared to the other methods. While the authors provide a justification, it seems rather weak. The method only applied to the synergistic combinations (performance was not a whole lot better than random – true positive rate of 56%) – but the text indicates that it predicted synergistic compound pairs. It’s not clear whether this means it made a call on synergy or a predicted ranking. If the latter it would’ve been interesting to see how it compared to the rankings of the synergistic subset of 91 compounds from other methods.

Written by Rajarshi Guha

November 20th, 2014 at 5:37 pm