# So much to do, so little time

Trying to squeeze sense out of chemical data

## From Algorithmic Fairness to QSAR Models

The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.

Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.

The core of Adler et al is their statement that

the information content of a feature can be estimated by trying to predict it from the remaining features.

Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.

The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.

Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature $$X_i$$ and regress all the other features $$X_j\ (j \neq i)$$ on it. Repeating this for all $$X_i$$ would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).

Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?

Written by Rajarshi Guha

August 8th, 2016 at 10:52 pm

## Database Licensing & Sustainability

Update (07/28/16): DrugBank/OMx have updated the licensing conditions for DrugBank data in response to concerns raised earlier by various people and groups. See here for a detailed response from Craig Knox

A few days back I came across, via my Twitter network, the news that DrugBank had changed their licensing policy to CC BY-SA-NC 4.0. As such this is not a remarkable change (though one could argue about the NC clause, since as John Overington points out the distinction between commercial and non-commercial usage can be murky). However, on top of this license, the EULA listed a number of more restrictive conditions on reuse of the data. See this thread on ThinkLab for a more detailed discussion and breakdown.

This led to discussion amongst a variety of people regarding the sustainability of data resources. In this case while DrugBank was (and is) funded by federal grants, these are not guaranteed in perpetuity. And thus DrugBank, and indeed any resource, needs to have a plan to sustain itself. Charging for commercial access is one such approach. While it can be  problematic for reuse and other Open projects, one cannot fault the developers if they choose a path that enables them to continue to build upon their work.

Interestingly, the Guide to Pharmacology resource posted a response to the DrugBank license change, in which they don’t comment on the DrugBank decision but do point out that

The British Pharmacological Society (BPS) has committed support for GtoPdb until 2020 and the Wellcome Trust support for GtoImmuPdb until 2018. Needless to say the management team (between, IUPHAR, BPS and the University of Edinburgh) are engaged in sustainability planning beyond those dates. We have also just applied for UK ELIXIR Node consideration.

So it’s nice to see that the resource is completely free of any onerous restrictions until 2020. I have no doubt that the management team will be working hard to secure funding beyond that date. But in case they don’t, will their licensing also change to support some form of commercialization? Certainly, other resources are going down that path. John Overington pointed to BioCyc switching to a subscription model

So the sustainability of data resources is an ongoing problem, and will become a bigger issue as the links between resources grows over time. Economic considerations would suggest that permanent funding of every database  cannot happen.

So clearly, some resources will win and some will lose, and the winners will not stay winners forever.

### Open source software & transferring leadership

However in contrast to databases, many Open Source software projects do continue development over pretty long time periods. Some of these projects receive public funding and also provide dual licensing options, allowing for income from industrial users.

However there are others which are not heavily funded, yet continue to develop. My favorite example is Jmol which has been in existence for more than 15 years and has remained completely Open Source. One of the key features of this project is that the leadership has passed from one individual to another over the years, starting I think with Dan Gezelter, then Bradley Smith, Egon Willighagen, Miguel Rojas and currently Bob Hanson.

Comparing Open software to Open databases is not fully correct. But this notion of leadership transition is something that could play a useful role in sustaining databases. Thus, if group X cannot raise funding for continued development, maybe group Y (that obviously benefits from the database) that has funding, could take over development and maintenance.

There are obvious reasons that this won’t work – maybe the expertise resides only in group X? I doubt this is really an issue, at least for non-niche databases. One could also argue that this approach is a sort of proto-crowdsourcing approach. While crowdsourcing did come up in the Twitter thread, I’m not convinced this is a scalable approach to sustainability. The “diffuse motivation” of a crowd is quite distinct from the “focused motivation” of a dedicated group. And on top of that, many databases are specialized and the relevant crowd is rather small.

One ultimate solution is that governments host databases in perpetuity. This raises a myriad issues. Does it imply storage and no development? Is this for all publicly funded databases? Or a subset? Who are the chosen ones? And of course, how long will the government pay for it? The NIH Commons, while not being designed for database persistence, is one such prototypical infrastructure that could start addressing these questions.

In conclusion, the issue of database sustainability is problematic and unsolved and the problem is only going to get worse. While unfortunate for Open science (and science in general) the commercialization of databases will always be a possibility. One hopes that in such cases, a balance will be struck between income and free (re)usage of these valuable resources.

Written by Rajarshi Guha

May 14th, 2016 at 7:26 pm

## Differential Dose Response – Some More Exploration

This is a follow on to my previous post that described a recent paper where we explored a few ways to characterize the differential activity of small molecules in dose response screens. In this post I wanted to highlight some aspects of this type of analysis that didn’t make it into the final paper.

TL;DR there’s more to differential analysis of dose response data than thresholding and ranking.

### Comparing Model Fits

One approach to characterizing differential activity is to test whether the curve fit models (in our case 4-parameter Hill models) are indistinguishable or not. While traditionally, ANOVA could be used to test this, it assumes that the models being compared are nested. This is not the case when testing for effects of different treatments (i.e., same model, but different datasets). As a result we first considered the use of AIC – but even then, applying this to the same model built on different datasets is not really valid.

Another approach (described by Ritz et al) that we considered was to refit the curves for the two treatments simultaneously using replicates, and determines whether the ratio of the AC50’s (termed the Selectivity Index or SI) from the two models was different from 1.0. We can then test the hypothesis and determine whether the SI was statistically significant or not. The drawback is that it, ideally, requires that the curves differ only in potency. In practice this is rarely the case as effects such as toxicity might cause a shift the in the response at low concentrations, partial efficacy might cause incomplete curves at high concentrations and so on.

We examined this approach by fitting curves such that the top and bottom of the curves were constrained to be identical in both treatments and only the Hill slope and AC50 were allowed to vary.

After, appropriate correction, this identified molecules that exhibited p < 0.05 for the hypothesis that the SI was not 1.0. Independent and constrained curve fits for two compounds are shown alongside. While the constraint of equal top and bottom for both curves does lead to some differences compared to independent fits (especially from the point of view of efficacy), the current data suggests that the advantage of such a constraint (allowing robust inference on the statistical significance of SI) outweighs the disadvantages.

### Variance Stabilization

Finally, given the rich literature on differential analysis for genomic data, our initial hope was to simply apply methods from that domain to the current problems. However, variance stabilization becomes an issue when dealing with small molecule data. It is well known from gene expression experiments that that the variance in replicate measurements can be a function of the mean value of the replicates. If not taken into account, this can mislead a t-test into identifying a gene (or compound, in our case) as exhibiting non-differential behavior, when in fact it is differentially expressed (or active).

The figure below compares the standard deviation (SD) versus mean of each compound, for each parameter in the two treatments (HA22, an immunotoxin and PBS, the vehicle treatment). Overlaid on the scatter plot is a loess fit. In the lower panel, we see that in the PBS treatment there is minimal dependency of SD on the mean values, except for the case of log AC50. However, for the case of HA22 treatment, each parameter shows a distinct dependence of SD on the mean replicate value.

Many approaches have been designed to address this issue in genomic data (e.g., Huber et al, Durbin et al, Papana & Ishwaran). One of the drawbacks of most approaches is that they assume a distributional model for the errors (which in the case of the small molecule data would correspond to the true parameter value minus the calculated value) or a specific model for the mean-variance relationship. However, to our knowledge, there is no general solution to the problem of choosing an appropriate error distribution for small molecule activity (or curve parameter) data. A non-parametric approach described by Motakis et al employs the observed replicate data to stabilize the variance, avoiding any distributional assumptions. However, one requirement is that the mean-variance relationship be monotonic increasing. From the figure above we see that this is somewhat true for efficacy but does not hold, in a global sense, for the other parameters.

Overall, differential analysis of dose response data is somewhat of an open topic. While simple cases of pure potency or efficacy shifts can be easily analyzed, it can be challenging when all four curve fit parameters change. I’ve also highlighted some of the issues with applying methods devised for genomic data to small molecule data – solutions to these would enable the reuse of some powerful machinery.

Written by Rajarshi Guha

May 7th, 2016 at 5:03 pm

## Analysing Differential Activity in Dose Response Screens

with one comment

My colleagues and I recently published a paper where we explored a few methods to identify differential behavior in dose response screens. While there is an extensive literature about analyzing differential effects in genomic data (e.g. mciroarrays, RNAseq), these methods are based on distributional assumptions that holds for genomic data. This is not necessarily the case for small molecule, dose response data. A separate post will explore this aspect.

So we couldn’t directly apply the methods devised for genomic data. Another issue that we wanted to address was the lack of replicates. As a result certain methods are excluded from consideration (e.g., t-test based methods). The simplest case (or what we refer to as obviously differential) is when a compound is active in one treatment and completely inactive in the other. This is trivial to characterize. The next method we considered was to look at fold changes for individual curve fit parameters and then choose an arbitrary threshold. This is not a particularly robust approach, and has no real statistical basis. However, such thresholding is still used in a number of scenarios (e.g., cherry picking in single point screens). In addition, in this approach you have to choose one of many parameters. So finally, we considered a data fusion approach, that ranked compounds using the rank product method. This method employed potency, response at the highest concentration and the AUC. The nice thing about this method is that it doesn’t require choosing a threshold, provides an empirical p-value and is flexible enough to include other relevant parameters (say, physicochemical properties).

Finally, we examined how single point data (modeled using the response at the highest concentration) compared to dose response data at identifying differential actives. As one might expect, the obviously differential compounds were easily identified. However for compounds active in both treatments, the single point approach led to more false positives. Thus, even those dose response is more resource-intensive, the improved accuracy makes it worth it.

In the next post I’ll look at some of the issues that didn’t make in to this paper – in particular hypothesis based tests that focus on testing differences between model fits. One key observation (also suggested by Gelman) is that strict p-value cutoffs lead one to focus on obvious or well-known effects. For small-scale exploratory analyses such as described in this paper, a more relaxed threshold of 0.1 might be more suitable, allowing marginal effects that may, however, be biologically interesting to be considered.

Written by Rajarshi Guha

May 2nd, 2016 at 2:10 am

## vSDC, Rank Products and DUD-E

This post is a follow-up to my previous discussion on a paper by Chaput et al. The gist of that paper was that in a virtual screening scenario where a small number of hits are to be selected for followup, one could use an ensemble of docking methods, identify compounds whose scores were beyond 2SD of the mean for each method and take the intersection. My post suggested that a non-parametric approach (rank products, RP) performed similarly to the parametric approach of Chaput et al on the two targets they screened.

The authors also performed a benchmark comparison of their consensus method (vSDC) versus the individual docking methods for 102 DUD-E targets. I was able to obtain the individual docking scores (Glide, Surflex, FlexX and GOLD) for each of the targets, with the aim of applying the rank product method described previously.

In short, I reproduced Figure 6A (excluding the curve for vSDC). In
this figure, $$n_{test}$$ is the number of compounds selected (from the ranked list, either by individual docking scores or by the rank product) and $$T_{h>0}$$ is the percentage of targets for which the $$n_{test}$$ selected compounds included one or more actives. Code is available here, but you’ll need to get in touch with the authors for the DUD-E docking scores.

As shown alongside, the RP method (as expected) outperforms the individual docking methods. And visual comparison with the original figure suggests that it also outperforms vSDC, especially at lower values of $$n_{test}$$. While I wouldn’t regard the better performance of RP compared to vSDC as a huge jump, the absence of a threshold certainly works in its favor.

One could certainly explore ranking approaches in more depth. As suggested by Abhik Seal, Borda or Condorcet methods could be examined (though the small number of docking methods, a.k.a., voter, could be problematic).

UPDATE: After a clarification from Liliane Mouawad it turns out there was a mistake in the ranking of the Surflex docking scores. Correcting that bug fixes my reproduction of Figure 6A so that the curves for individual docking methods match the original. But more interestingly, the performance of RP is now clearly better than every individual method and the vSDC method as well, at all values of $$n_{test}$$

Written by Rajarshi Guha

February 13th, 2016 at 7:25 pm