# So much to do, so little time

Trying to squeeze sense out of chemical data

## Differential Dose Response – Some More Exploration

This is a follow on to my previous post that described a recent paper where we explored a few ways to characterize the differential activity of small molecules in dose response screens. In this post I wanted to highlight some aspects of this type of analysis that didn’t make it into the final paper.

TL;DR there’s more to differential analysis of dose response data than thresholding and ranking.

### Comparing Model Fits

One approach to characterizing differential activity is to test whether the curve fit models (in our case 4-parameter Hill models) are indistinguishable or not. While traditionally, ANOVA could be used to test this, it assumes that the models being compared are nested. This is not the case when testing for effects of different treatments (i.e., same model, but different datasets). As a result we first considered the use of AIC – but even then, applying this to the same model built on different datasets is not really valid.

Another approach (described by Ritz et al) that we considered was to refit the curves for the two treatments simultaneously using replicates, and determines whether the ratio of the AC50’s (termed the Selectivity Index or SI) from the two models was different from 1.0. We can then test the hypothesis and determine whether the SI was statistically significant or not. The drawback is that it, ideally, requires that the curves differ only in potency. In practice this is rarely the case as effects such as toxicity might cause a shift the in the response at low concentrations, partial efficacy might cause incomplete curves at high concentrations and so on.

We examined this approach by fitting curves such that the top and bottom of the curves were constrained to be identical in both treatments and only the Hill slope and AC50 were allowed to vary.

After, appropriate correction, this identified molecules that exhibited p < 0.05 for the hypothesis that the SI was not 1.0. Independent and constrained curve fits for two compounds are shown alongside. While the constraint of equal top and bottom for both curves does lead to some differences compared to independent fits (especially from the point of view of efficacy), the current data suggests that the advantage of such a constraint (allowing robust inference on the statistical significance of SI) outweighs the disadvantages.

### Variance Stabilization

Finally, given the rich literature on differential analysis for genomic data, our initial hope was to simply apply methods from that domain to the current problems. However, variance stabilization becomes an issue when dealing with small molecule data. It is well known from gene expression experiments that that the variance in replicate measurements can be a function of the mean value of the replicates. If not taken into account, this can mislead a t-test into identifying a gene (or compound, in our case) as exhibiting non-differential behavior, when in fact it is differentially expressed (or active).

The figure below compares the standard deviation (SD) versus mean of each compound, for each parameter in the two treatments (HA22, an immunotoxin and PBS, the vehicle treatment). Overlaid on the scatter plot is a loess fit. In the lower panel, we see that in the PBS treatment there is minimal dependency of SD on the mean values, except for the case of log AC50. However, for the case of HA22 treatment, each parameter shows a distinct dependence of SD on the mean replicate value.

Many approaches have been designed to address this issue in genomic data (e.g., Huber et al, Durbin et al, Papana & Ishwaran). One of the drawbacks of most approaches is that they assume a distributional model for the errors (which in the case of the small molecule data would correspond to the true parameter value minus the calculated value) or a specific model for the mean-variance relationship. However, to our knowledge, there is no general solution to the problem of choosing an appropriate error distribution for small molecule activity (or curve parameter) data. A non-parametric approach described by Motakis et al employs the observed replicate data to stabilize the variance, avoiding any distributional assumptions. However, one requirement is that the mean-variance relationship be monotonic increasing. From the figure above we see that this is somewhat true for efficacy but does not hold, in a global sense, for the other parameters.

Overall, differential analysis of dose response data is somewhat of an open topic. While simple cases of pure potency or efficacy shifts can be easily analyzed, it can be challenging when all four curve fit parameters change. I’ve also highlighted some of the issues with applying methods devised for genomic data to small molecule data – solutions to these would enable the reuse of some powerful machinery.

Written by Rajarshi Guha

May 7th, 2016 at 5:03 pm

## Analysing Differential Activity in Dose Response Screens

with one comment

My colleagues and I recently published a paper where we explored a few methods to identify differential behavior in dose response screens. While there is an extensive literature about analyzing differential effects in genomic data (e.g. mciroarrays, RNAseq), these methods are based on distributional assumptions that holds for genomic data. This is not necessarily the case for small molecule, dose response data. A separate post will explore this aspect.

So we couldn’t directly apply the methods devised for genomic data. Another issue that we wanted to address was the lack of replicates. As a result certain methods are excluded from consideration (e.g., t-test based methods). The simplest case (or what we refer to as obviously differential) is when a compound is active in one treatment and completely inactive in the other. This is trivial to characterize. The next method we considered was to look at fold changes for individual curve fit parameters and then choose an arbitrary threshold. This is not a particularly robust approach, and has no real statistical basis. However, such thresholding is still used in a number of scenarios (e.g., cherry picking in single point screens). In addition, in this approach you have to choose one of many parameters. So finally, we considered a data fusion approach, that ranked compounds using the rank product method. This method employed potency, response at the highest concentration and the AUC. The nice thing about this method is that it doesn’t require choosing a threshold, provides an empirical p-value and is flexible enough to include other relevant parameters (say, physicochemical properties).

Finally, we examined how single point data (modeled using the response at the highest concentration) compared to dose response data at identifying differential actives. As one might expect, the obviously differential compounds were easily identified. However for compounds active in both treatments, the single point approach led to more false positives. Thus, even those dose response is more resource-intensive, the improved accuracy makes it worth it.

In the next post I’ll look at some of the issues that didn’t make in to this paper – in particular hypothesis based tests that focus on testing differences between model fits. One key observation (also suggested by Gelman) is that strict p-value cutoffs lead one to focus on obvious or well-known effects. For small-scale exploratory analyses such as described in this paper, a more relaxed threshold of 0.1 might be more suitable, allowing marginal effects that may, however, be biologically interesting to be considered.

Written by Rajarshi Guha

May 2nd, 2016 at 2:10 am

## Summarizing Collections of Curves

I was browsing live notes from the recent IEEE conference on visualization and came across a paper about functional boxplots. The idea is an extension of the boxplot visualization (shown alongside), to a set of functions. Intuitively, one can think of a functional box plot as specific envelopes for a set of functions. The construction of this plot is based on the notion of band depth (see the more general concept of data depth) which is a measure of how far a given function is from the collection of functions. As described in Sun & Genton the band depth for a given function can be computed by randomly selecting $$J$$ functions and identifying wether the given function is contained within the minimum and maximum of the $$J$$ functions. Repeating this multiple times, the fraction of times that the given function is fully contained within the $$J$$ random functions gives the band depth, $$BD_j$$. This is then used to order the functions, allowing one to compute a 50% band, analogous to the IQR in a traditional boxplot. There are more details (choice of $$J$$, partial bounding, etc.) described in the papers and links above.

My interest in this approach was piqued since one way of summarizing a dose response screen, or comparing dose response data across multiple conditions is to generate a box plot of a single curve fit parameter – say, $$\log IC_{50}$$. But if we wanted to consider the curves themselves, we have a few options. We could simply plot all of them, using translucency to avoid a blob. But this doesn’t scale visually. Another option, on the left, is to draw a series of box plots, one for each dose, and then optionally join the median of each boxplot giving a “median curve”. While these vary in their degree of utility, the idea of summarizing the distribution of a set of curves, and being able to compare these distributions is attractive. Functional box plots look like a way to do this. (A cool thing about functional boxplots is that they can be extended to multivariate functions such as surfaces and so on. See Mirzargar et al for examples)

Computing $$BD_j$$ can be time consuming if the number of curves is large or $$J$$ is large. Lopez-Pintado & Jornsten suggest a simple optimization to speed up this step, and for the special case of $$J = 2$$, Sun et al proposed a ranking based procedure that scales to thousands of curves. The latter is implemented in the fda package for R which also generates the final functional box plots.

As an example I considered 6 cell proliferation assays run in dose response, each one running the same set of compounds, but under different growth conditions. For each assay I only considered good quality curves (giving from 349 to 602 curves). The first plot compares the actives identified in the different growth conditions using the $$\log IC_{50}$$, and indicates a statistically significant increase in potency in the last three conditions compared to the first three.

In contrast, the functional box plots for the 6 assays, suggest a somewhat different picture (% Response = 100 corresponds to no cell kill and 0 corresponds to full cell kill).

The red dashed curves correspond to outliers and the blue lines correspond to the ‘maximum’ and ‘minimum’ curves (analogous to the whiskers of the traditional boxplot). Importantly, these are not measured curves, but instead correspond to the dose-wise maximum (and minimum) of the real curves. The pink region represents 50% of the curves and the black line represents the (virtual) median curve. In each case the X-axis corresponds to dose (unlabeled to save space). Personally, I think this visualization is a little cleaner than the dose-wise box plot shown above.

The mess of red lines in the plot 1 suggest an issue with the assay itself. While the other plots do show differences, it’s not clear what one can conclude from this. For example, in the plot for 4, the dip on the left hand side (i.e., low dose) could suggest that there is a degree of cytotoxicity, which is comparatively less in 3, 5 and 6. Interestingly none of the median curves are really sigmoidal, suggesting that the distribution of dose responses has substantial variance.

Written by Rajarshi Guha

November 30th, 2014 at 3:02 pm

## Ranking Dose Response Curves

UPDATE (3/21) – I was contacted by the author of the paper who pointed out that my analysis was based on a misunderstanding of the paper. Specifically

1. The primary goal of WES is to identify actives – and according to the authors definition, the most interesting actives (that should be ranked highly) are those that have no dose response and show a constant activity equal to the positive control. Next in importance are compounds that exhibit a dose response. Finally the least interesting (and so lowest ranked) are those that show no dose response and are flat at the negative control level.
2. The WES method requires that data be normalized such that DMSO (i.e., negative control) is at 0 and positive control is at 100%.

Since my analysis was based on the wrong normalization scheme the conclusions were erroneous. When the proper normalization is taken into account, the method works as advertised in that it correctly ranks compounds that show constant activity at the positive control level at the top, followed by curves with a dose response and finally with inactives (no activity at all) at the bottom.

Based on this I’ve updated the figures and text to correct my mistake. However, in my opinion, if the goal is to identify compounds that have a constant activity one does not need to go to entropy. In addition, for the case of compounds with a well defined dose response, the WES essentially ranks them by potency (assuming a valid curve fit). The updated text goes on to discuss these aspects.

UPDATE (2/25) – Regenerated the enrichment curves so that data was ranked in the correct order when LAC50 was being used.

I came across a paper that describes the use of weighted entropy to rank order dose response curves. As this data type is the bread and butter of my day job a simple ranking method is always of interest to me. While the method works as advertised, it appears to be a rather constrained method and doesn’t seem to do a whole let better than simpler, pre-existing approaches.

The paper correctly notes that there is no definitive protocol to rank compounds using their dose response curves. Such rankings are invariably problem dependent – in some cases, simple potency based ranking of good quality curves is sufficient. In other cases structural clustering combined with a measure of potency enrichment is more suitable. In addition, it is also true that all compounds in a screen do not necessarily fit well to a 4-parameter Hill model. This may simply be due to noise but could also be due to some process that is better fit by some other model (bell or U shaped curves). The point being that rankings based on a pre-defined model may not be useful or accurate.

The paper proposes the use of entropy as a way to rank dose response curves in a model-free manner. While a natural approach is to use Shannon entropy, the author suggests that the equal weighting implicit in the calculation is unsuitable. Instead, the use of weighted entropy (WES) is proposed as a more robust approach that takes into account unreliable data points. The author defines the weights based on the level of detection of the assay (though I’d argue that since the intended goal is to capture the reliability of individual response points, a more appropriate weight should be derived from some form of variance – either from replicate data or else pooled across the collection) . The author then suggests that curves should be ranked by the WES value, with higher values indicating a better rank.

For any proposed ranking scheme, one must first define what the goal is. When ranking dose response curves are we looking for compounds

• that exhibit well defined dose response (top and bottom asymptotes, > 80% efficacy etc)?
• good potency, even if the curve is not that well fit?
• compounds with a specific chemotype?

According to the paper, a key goal is to be able to identify compounds that show a constant activity – and within such compounds the more interesting ones are those that have constant activity = 100%. While I disagree that these are the most interesting compounds, it is not clear why one would need an entropy based method to identify such constant-activity curves (either at 100% or 0%).

More generally, for well defined dose response curves, the WES, by definition, tracks potency. This can be seen in the figure alongside that plots the WES value vs the log AC50 for a set of 27 good quality curves taken from a screen of 1408 AR agonists. Granted, when no model can be fit, one does not have an AC50, whereas a WES can be evaluated. But in such a case it’s not clear why one would necessarily want to quantify presumably noisy data.

However, going along with the authors definition, the method does distinguish valid dose responses from inactives (though again, one does not require entropy to make such a distinction!) as shown in the adjoining figure. It is clear from the definition of WES that a curve that is flat at 100% will exhibit the maximum value of WES and so will always rank high.

One way to to test the performance of ranking methods this is to take a collection of curves, rank them by a measure and identify how many actives are identified in the top N% of the collection, for varying N. Ideally, a good ranking would identify nearly all the actives for a small N. If the ranking were random one would identify N% of the actives in the top N% of the collection. Here an active is defined in terms of curve class, a heuristic that we use to initially weed out poor quality curves and focus on good quality ones. I defined active as curve classes 1.1, 1.2, 2.2 and 2.1 (see here for a summary of curve classes).

As pointed out by the author during our conversation, this is not an entirely fair comparison – since my scheme does not consider a flat curve at 100% as active. Though it’s a valid point, the dataset I worked with did not have any such curves. More generally, such curves would be the exception in a qHTS screen (assuming the concentration ranges have been correctly chosen). From that point of view, one should be able to apply WES to generate a ranking for any qHTS screen otherwise one would have to inspect the curves first to ensure that it contains such “flat actives” and then apply WES. Which is not the right way to go about it.

As shown in the enrichment plot shown alongside (generated for the 1408 compound AR agonist dataset), WES works better than random (and much better than the standard Shannon entropy), but is still outperformed by the area under the dose response curve (AUC) and potency. I certainly don’t claim that AUC is a completely robust way to rank dose response curves (in fact for some cases such as invalid curve fits, it’d be nonsensical). I also include LAC50, the logarithm of the AC50, as a ranking method simply because the paper considers it a poor way to rank curves (which I agree with, particularly if one does not first filter for good quality, efficacious curves).

There are a few other issues, though I think the most egregious one was that the method was tested on just one dataset. I’m not convinced that a single dataset represents a sufficient validation (given that Tox21 has about 80 published bioassays in PubChem). But that’s a case of poor reviewing rather than a technical flaw.

Written by Rajarshi Guha

July 23rd, 2014 at 1:52 pm