# So much to do, so little time

Trying to squeeze sense out of chemical data

## From Algorithmic Fairness to QSAR Models

The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.

Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.

The core of Adler et al is their statement that

the information content of a feature can be estimated by trying to predict it from the remaining features.

Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.

The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.

Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature $$X_i$$ and regress all the other features $$X_j\ (j \neq i)$$ on it. Repeating this for all $$X_i$$ would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).

Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?

Written by Rajarshi Guha

August 8th, 2016 at 10:52 pm

## vSDC, Rank Products and DUD-E

This post is a follow-up to my previous discussion on a paper by Chaput et al. The gist of that paper was that in a virtual screening scenario where a small number of hits are to be selected for followup, one could use an ensemble of docking methods, identify compounds whose scores were beyond 2SD of the mean for each method and take the intersection. My post suggested that a non-parametric approach (rank products, RP) performed similarly to the parametric approach of Chaput et al on the two targets they screened.

The authors also performed a benchmark comparison of their consensus method (vSDC) versus the individual docking methods for 102 DUD-E targets. I was able to obtain the individual docking scores (Glide, Surflex, FlexX and GOLD) for each of the targets, with the aim of applying the rank product method described previously.

In short, I reproduced Figure 6A (excluding the curve for vSDC). In
this figure, $$n_{test}$$ is the number of compounds selected (from the ranked list, either by individual docking scores or by the rank product) and $$T_{h>0}$$ is the percentage of targets for which the $$n_{test}$$ selected compounds included one or more actives. Code is available here, but you’ll need to get in touch with the authors for the DUD-E docking scores.

As shown alongside, the RP method (as expected) outperforms the individual docking methods. And visual comparison with the original figure suggests that it also outperforms vSDC, especially at lower values of $$n_{test}$$. While I wouldn’t regard the better performance of RP compared to vSDC as a huge jump, the absence of a threshold certainly works in its favor.

One could certainly explore ranking approaches in more depth. As suggested by Abhik Seal, Borda or Condorcet methods could be examined (though the small number of docking methods, a.k.a., voter, could be problematic).

UPDATE: After a clarification from Liliane Mouawad it turns out there was a mistake in the ranking of the Surflex docking scores. Correcting that bug fixes my reproduction of Figure 6A so that the curves for individual docking methods match the original. But more interestingly, the performance of RP is now clearly better than every individual method and the vSDC method as well, at all values of $$n_{test}$$

Written by Rajarshi Guha

February 13th, 2016 at 7:25 pm

## Hit Selection When You’re Strapped for Cash

with one comment

I came across a paper from Chaput et al that describes an approach to hit selection from a virtual screen (using docking), when follow-up resources are limited (a common scenario in many academic labs). Their approach is based on using multiple docking programs. As they (and others) have pointed out, there is a wide divergence between the rankings of compounds generated using different programs. Hence the motivation for a consensus approach, based on the estimating the standard deviation (SD) of scores generated by a given program and computing the intersection of compounds whose scores are greater than 2 standard deviations from the mean, in each program. Based on this rule, they selected relatively few compounds – just 14 to 22, depending on the target and confirmed at least one of them for each target. This represents less than 0.5% of their screening deck.

However, their method is parametric – you need to select a SD threshold. I was interested in seeing whether a non-parametric, ranking based approach would allow one to retrieve a subset that included the actives identified by the authors. The method is essentially the rank product method applied to the docking scores. That is, the compounds are ranked based on their docking scores and the “ensemble rank” for a compound is the product of its ranks according to each of the four programs. In contrast to the original definition, I used a sum log rank to avoid overflow issues. So the ensemble rank for the $$i$$’th compound is given by

$$R_i = \sum_{j=1}^{4} \log r_{ij}$$

where $$r_{ij}$$ is the rank of the $$i$$’th compound in the $$j$$’th docking program. Compounds are then selected based on their ensemble rank. Obviously this doesn’t give you a selection per se. Instead, this allows you to select as many compounds as you want or need. Importantly, it allows you to introduce external factors (cost, synthetic feasibility, ADME properties, etc.) as additional rankings that can be included in the ensemble rank.

Using the docking scores for Calcineurin and Histone Binding Protein (Hbp) provided by Liliane Mouawad (though all the data really should’ve been included in the paper) I applied this method using the code below

 12345678910 library(stringr) d <- read.table('http://cmib.curie.fr/sites/u759/files/document/score_vs_cn.txt',                 header=TRUE, comment='') names(d) <- c('molid', 'Surflex', 'Glide', 'Flexx', 'GOLD') d$GOLD <- -1*d$GOLD ## Since higher scores are better ranks <- apply(d[,-1], 2, rank) lranks <- rowSums(log(ranks)) tmp <- data.frame(molid=d[,1], ranks, lrp=rp) tmp <- tmp[order(tmp$lrp),] which(str_detect(tmp$molid, 'ACTIVE'))

and identified the single active for Hbp at ensemble rank 8 and the three actives for Calcineurin at ranks 3, 5 and 25. Of course, if you were selecting only the top 3 you would’ve missed the Calcineurin hit and only have gotten 1/3 of the HBP hits. However, as the authors nicely showed, manual inspection of the binding poses is crucial to making an informed selection. The ranking is just a starting point.

Update: Docking scores for Calcineurin and Hbp are now available

Written by Rajarshi Guha

February 5th, 2016 at 1:36 am

## Cryptography & Chemical Structure Search

Encryption of chemical information has not been a very common topic in cheminformatics. There was an ACS symposium in 2005 (summary) that had a number of presentations on the topic of “safe exchange” of chemical information – i.e., exchanging information on chemical structures without sharing the structures themselves. The common thread running through many presentations was to identify representations (a.k.a, descriptors) that can be used for useful computation (e.g., regression or classification models or similarity searches) but do not allow one to (easily) regenerate the structure. Examples include the use of PASS descriptors and various topological indices. Non-descriptor based approaches included, surrogate data (that is structures of related molecules with similar properties) and most recently, scaffold networks. Also, Masek et al, JCIM, 2008 described a procedure to assess the risk of revealing structure information given a set of descriptors.

As indicated by Tetko et al, descriptor based approaches are liable to dictionary based attacks. Theoretically if one fully enumerates all possible molecules and computes the descriptors it would be trivial to obtain the structure of an obfuscated molecule. While this is not currently practical, Masek et al have already shown that an evolutionary algorithm can reconstruct the exact (or closely related) structure from BCUT descriptors in a reasonable time frame and Wong & Burkowski, JCheminf, 2009 described a kernel approach to generating structures from a set of descriptors (though they were considering the inverse QSAR problem rather than chemical privacy). Uptil now I wasn’t aware of approaches that were truly one way – impossible to regenerate the structure from the descriptors, yet also perform useful computations.

Which brings me to an interesting paper by Shimuzu et al which describes a cryptographic approach to chemical structure search, based on homomorphic encryption. A homomorphic encryption scheme allows one to perform computations on the encrypted (usually based on PKI) input leading to an encrypted result, which when decrypted gives the same result as if one had performed the computation on the clear (i.e., unecnrypted) input. Now, a “computation” can involve a variety of operations – addition, multiplication etc. Till recently, most homomorphic schemes were restricted to one or a few operations (and so are termed partially homomorphic). It was only in 2009 that a practical proposal for a fully homomorphic (i.e., supporting arbitrary computations) cryptosystem was described. See this excellent blog post for more details on homomorphic cryptosystems.

The work by Shimuzu et al addresses the specific case of a user trying to identify molecules from a database that are similar to a query structure. They consider a simplified situation where the user is only interested in the count of molecules above a similarity threshold. Two constraints are:

1. Ensure that the database does not know the actual query structure
2. The user should not gain information about the database contents (except for number of similar molecules)

Their scheme is based on a additive homomorphic system (i.e., the only operation supported on the encrypted data is addition) and employs binary fingerprints and the Tversky similarity metric (which can be reduced to Tanimoto if required). I note that they used 166-bit MACCS keys. Since it’s small and each bit position is known it seems that some information could leak out of the encrypted fingerprint or be subject to a dictionary attack. I’d have expected that using a larger hashed fingerprint would have helped improve the security. (Though I suspect that the encryption of the query fingerprint alleviates this issue). Another interesting feature, designed to prevent information about the database leaking back to the user is the use of “dummies” – random, encrypted (by the users public key) integers that are mixed with the true (encrypted) query result. Their design allows the user to determine the sign of the query result (which indicates whether the database molecule is similar to the query, above the specified threshold), but does not let them get the actual similarity score. They show that as the number of dummies is increased, the chances of database information leaking out tends towards zero.

Of course, one could argue that the limited usage of proprietary chemical information (in terms of people who have it and people who can make use of it) means that the efforts put in to obfuscation, cryptography etc. could simply be replaced by legal contracts. Certainly, a simple way to address the scenario discussed here (and noted by the authors) is to download the remote database locally. of course this is not feasible if the remote database is meant to stay private (e.g., a competitors structure database).

But nonetheless, methods that rigorously guarantee privacy of chemical information are interesting from an algorithmic standpoint. Even though Shimuzu et al described a very simplistic situation (though the more realistic scenario where the similar database molecules are returned would obviously negate constraint 2 above), it looks like a step forward in terms of applying formal cryptanalysis to chemical problems and supporting truly safe exchange of chemical information.

Written by Rajarshi Guha

January 5th, 2016 at 3:17 am

## Maximally Bridging Rings (or, Doing What the Authors Should’ve Done)

with one comment

Recently I came across a paper from Marth et al that described a method based on network analysis to support retrosynthetic planning, particularly for complex natural products. I’m no synthetic chemist so I can’t comment on the relevance or importance of the targets or the significance of the proposed approach to planning a synthetic route. What caught my eye was the claim that

This work validates the utility of network analysis as a starting point for identifying strategies for the syntheses of architecturally complex secondary metabolites.

I was a little disappointed (hey, a Nature publication sets certain expectations!) that the network analysis was fundamentally walking the molecular graph to identify a certain type of ring, termed the maximally bridging ring. The algorithm is described in the SI and the authors make it available
as an online tool. Unfortunately they didn’t provide any source code for their algorithm, which was a bit irritating, given that the algorithm is a key component of the paper.

I put together an implementation using the CDK (1.5.12), available in a Github repo. It’s a quick hack, using the parameters specified in the paper, and hasn’t been extensively tested. However it seems to give the correct result for the first few test cases in the SI.

The tool will print out the hash code of the rings recognized as maximally bridging and also generate an SVG depiction with the first such ring highlighted in red, such as shown alongside. You can build a self-contained version of the tool as

 123 git clone git@github.com:rajarshi/maxbridgerings.git cd maxbridgerings mvn clean package

The tool can then be run (with the depiction output to Copaene.svg)

 12 java -jar target/MaximallyBridgingRings-1.0-jar-with-dependencies.jar \   "CC(C)C1CCC2(C3C1C2CC=C3C)C" Copaene

Written by Rajarshi Guha

December 24th, 2015 at 4:10 am