# So much to do, so little time

Trying to squeeze sense out of chemical data

## Ridit Analysis

While preparing material for an R workshop I ran at NCATS I was pointed to the topic of ridit analysis. Since I hadn’t heard of this technique before I decided to look into it and investigate how R could be used for such an analysis (and yes, there is a package for it).

## Why ridit analysis?

First, lets consider why one might consider ridit analysis. In many scenarios one might have data that is categorical, but the categories are ordered. This type of data is termed ordinal (sometimes also called “nominal with order”). An example might be a trial of an analgesics ability to reduce pain, whose outcome could be no pain, some pain, extreme pain. While there are three categories, it’s clear that there is an ordering to them. Analysis of such data usually makes use of methods devised for categorical data – but such methods will not make use of the information contained within the ordering of the categories. Alternatively, one might numerically code the groups using 1, 2 and 3 and then apply methods devised for continuous or discrete variables. This is not appropriate since one can change the results by simply changing the category coding.

Ridit analysis essentially transforms ordinal data to a probability scale (one could call it a virtual continuous scale). The term actually stands for relative to an identified distribution integral transformation and is analoguous to probit or logit. (Importantly, ridit analysis is closely related to the Wilcoxon rank sum test. As shown by Selvin, the Wilcoxon test statistic and the mean ridit are directly related).

## Definitions

Essentially, one must have at least two groups, one of which is selected as the reference group. Then for the non-reference group, the mean ridit is an

estimate of the probability that a random individual from that group will have a value on the underlying (virtual) continuous scale greater than or equal to the value for a random individual from the reference group.

So if larger values of the underlying scale imply a worse condition, then the mean ridit is the probability estimate that the random individual from the group is worse of than a random individual from the reference group (based on the interpretation from Bross). Based on the definition of a ridit (see here or here), one can compute confidence intervals (CI) or test the hypothesis that different groups have equal mean ridits. Lets see how we can do that using R

## Mechanics of ridit analysis

Consider a dataset taken from Donaldson, (Eur. J. Pain, 1988) which looked at the effect of high and low levels of radiation treatment on trials participants’s sleep. The numbers are counts of patients:

 123456789101112 sleep <- data.frame(pain.level=factor(c('Slept all night with no pain',                       'Slept all night with some pain',                       'Woke with pain - medication provided relief',                       'Woke with pain - medication provided no relief',                       'Awake most or all of night with pain'),                       levels=c('Slept all night with no pain',                         'Slept all night with some pain',                         'Woke with pain - medication provided relief',                         'Woke with pain - medication provided no relief',                         'Awake most or all of night with pain')),                     low.dose=c(3, 10, 6,  2, 1),                     high.dose=c(6,10,2,0,0))

Here the groups are in the columns (low.dose and high.dose) and the categories are ordered such tat Awake most or all of night with pain is the “maximum” category. To compute the mean ridits for each dose group we first reorder the table and then convert the counts to proportions and then compute ridits for each category (i.e., row).

 1234567891011121314151617 ## reorder table sleep <- sleep[ length(levels(sleep$pain.level)):1, ] ## compute proportions sleep$low.dose.prop <- sleep$low.dose / sum(sleep$low.dose) sleep$high.dose.prop <- sleep$high.dose / sum(sleep$high.dose) ## compute riddit ridit <- function(props) { ## props should be in order of levels (highest to lowest) r <- rep(-1, length(props)) for (i in 1:length(props)) { if (i == length(props)) vals <- 0 else vals <- props[(i+1):length(props)] r[i] <- sum(vals) + 0.5*props[i] } return(r) } sleep$low.dose.ridit <- ridit(sleep$low.dose.prop) sleep$high.dose.ridit <- ridit(sleep$high.dose.prop) The resultant table is below  123456 pain.level low.dose high.dose low.dose.prop high.dose.prop low.dose.ridit high.dose.ridit 5 Awake most or all of night with pain 1 0 0.04545455 0.0000000 0.97727273 1.0000000 4 Woke with pain - medication provided no relief 2 0 0.09090909 0.0000000 0.90909091 1.0000000 3 Woke with pain - medication provided relief 6 2 0.27272727 0.1111111 0.72727273 0.9444444 2 Slept all night with some pain 10 10 0.45454545 0.5555556 0.36363636 0.6111111 1 Slept all night with no pain 3 6 0.13636364 0.3333333 0.06818182 0.1666667 The last two columns represent the ridit values for each category and can be interpreted as a probability estimate that an individuals value on the underlying continuous scale is less than or equal to the midpoint of the corresponding interval The next step (and main point of the analysis) is to compute the mean ridit for a group (essentially the sum of the category proportions for that group weighted by the category ridits in the reference group) , based on a reference. In this case, lets assume the low dose group is the reference.  1 mean.r.high <- sum(with(sleep, high.dose.prop * low.dose.ridit)) which is 0.305, and can be interpreted as the probability that a patient receiving the high dose of radiation will experience more sleep interference than a patient in the low dose group. Importantly, since ridits are estimates of probabilities, the complementary ridit (i.e., using the high dose group as reference) comes out to 0.694 and is the probability that a patient in the low radiation dose group will experience more sleep interferance than a patient in the high dose group. ## Statistics on ridits There are a number of ways to compute CI’s on mean ridits or else test the hypothesis that the mean ridits differ between $k$ groups. Donaldsons method for CI calculation appears to be restricted to two groups. In contrast, Fleiss et al suggest an alternate method based on. Considering the latter, the CI for a group vs the reference group is given by $\overline{r}_i \pm B \frac{\sqrt{n_s +n_i}}{2\sqrt{3 n_s n_i}}$ where $\overline{r}_i$ is the mean ridit for the $i$‘th group, $n_s$ and $n_i$ are the sizes of the reference and query groups, respectively and $B$ is the multiple testing corrected standard error. If one uses the Bonferroni correction, it would be $1.96 \times 1$ since there is only two groups being compared (and so 1 comparison). Thus the CI for the mean ridit for the low dose group, using the high dose as reference is given by $0.694 \pm 1.96 \frac{\sqrt{18 + 22}}{2\sqrt{3 \times 18 \times 22}}$ which is 0.515 to 0.873. Given that the interval does not include 0.5, we can conclude that there is a statistically significant difference ($\alpha = 0.05$) in the mean ridits between the two groups. For the case of multiple groups, the CI for any group vs any other group (i.e., not considering the reference group) is given by $(\overline{r}_i - \overline{r}_j + 0.5) \pm B \frac{\sqrt{n_i + n_j}}{2\sqrt{3 n_i n_j}}$ Fleiss et al also describes how one can test the hypothesis that the mean ridits across all groups (including the reference) are equal using a $\chi^2$ statistic. In addition, they also describe how one can perform the same test between any group and the reference group. ## R Implementation I’ve implemented a function that computes mean ridits and their 95% confidence interval (which can be changed). It expects that the data is provided as counts for each category and that the input data.frame is ordered in descending order of the categories. You need to specify the variable representing the categories and the reference variable. As an example of its usage, we use the dataset from Fleiss et al which measured the degree of pain relief provided by different drugs after oral surgery. We perfrom a ridit analysis using aspirin as the reference group:  12345678 dental <- data.frame(pain.relief = factor(c('Very good', 'Good', 'Fair', 'Poor', 'None'), levels=c('Very good', 'Good', 'Fair', 'Poor', 'None')), ibuprofen.low = c(61, 17, 10, 6, 0), ibuprofen.high = c(52, 25, 5, 3, 1), Placebo = c(32, 37, 10, 18, 0), Aspirin = c(47, 25, 11, 4, 1) ) ridit(dental, 'pain.relief', 'Aspirin') which gives us  1234567891011121314151617$category.ridit   pain.relief ibuprofen.low ibuprofen.high    Placebo 1   Very good    0.67553191    0.697674419 0.83505155 2        Good    0.26063830    0.250000000 0.47938144 3        Fair    0.11702128    0.075581395 0.23711340 4        Poor    0.03191489    0.029069767 0.09278351 5        None    0.00000000    0.005813953 0.00000000 $mean.ridit ibuprofen.low ibuprofen.high Placebo 0.5490812 0.5455206 0.3839620$ci            group       low      high 1  ibuprofen.low 0.4361128 0.6620496 2 ibuprofen.high 0.4300396 0.6610016 3        Placebo 0.2718415 0.4960826

The results suggest that patients receiving either dose of ibuprofen will get better pain relief compared to aspirin. However, if you consider the CI’s it’s clear that they both contain 0.5 and thus there is no statistical difference in the mean ridits for these two doses, compared to aspirin. On the other hand, placebo definitely leads to less pain relief compared to aspirin.

Written by Rajarshi Guha

January 18th, 2015 at 7:26 pm

## Summarizing Collections of Curves

I was browsing live notes from the recent IEEE conference on visualization and came across a paper about functional boxplots. The idea is an extension of the boxplot visualization (shown alongside), to a set of functions. Intuitively, one can think of a functional box plot as specific envelopes for a set of functions. The construction of this plot is based on the notion of band depth (see the more general concept of data depth) which is a measure of how far a given function is from the collection of functions. As described in Sun & Genton the band depth for a given function can be computed by randomly selecting $J$ functions and identifying wether the given function is contained within the minimum and maximum of the $J$ functions. Repeating this multiple times, the fraction of times that the given function is fully contained within the $J$ random functions gives the band depth, $BD_j$. This is then used to order the functions, allowing one to compute a 50% band, analogous to the IQR in a traditional boxplot. There are more details (choice of $J$, partial bounding, etc.) described in the papers and links above.

My interest in this approach was piqued since one way of summarizing a dose response screen, or comparing dose response data across multiple conditions is to generate a box plot of a single curve fit parameter – say, $\log IC_{50}$. But if we wanted to consider the curves themselves, we have a few options. We could simply plot all of them, using translucency to avoid a blob. But this doesn’t scale visually. Another option, on the left, is to draw a series of box plots, one for each dose, and then optionally join the median of each boxplot giving a “median curve”. While these vary in their degree of utility, the idea of summarizing the distribution of a set of curves, and being able to compare these distributions is attractive. Functional box plots look like a way to do this. (A cool thing about functional boxplots is that they can be extended to multivariate functions such as surfaces and so on. See Mirzargar et al for examples)

Computing $BD_j$ can be time consuming if the number of curves is large or $J$ is large. Lopez-Pintado & Jornsten suggest a simple optimization to speed up this step, and for the special case of $J = 2$, Sun et al proposed a ranking based procedure that scales to thousands of curves. The latter is implemented in the fda package for R which also generates the final functional box plots.

As an example I considered 6 cell proliferation assays run in dose response, each one running the same set of compounds, but under different growth conditions. For each assay I only considered good quality curves (giving from 349 to 602 curves). The first plot compares the actives identified in the different growth conditions using the $\log IC_{50}$, and indicates a statistically significant increase in potency in the last three conditions compared to the first three.

In contrast, the functional box plots for the 6 assays, suggest a somewhat different picture (% Response = 100 corresponds to no cell kill and 0 corresponds to full cell kill).

The red dashed curves correspond to outliers and the blue lines correspond to the ‘maximum’ and ‘minimum’ curves (analogous to the whiskers of the traditional boxplot). Importantly, these are not measured curves, but instead correspond to the dose-wise maximum (and minimum) of the real curves. The pink region represents 50% of the curves and the black line represents the (virtual) median curve. In each case the X-axis corresponds to dose (unlabeled to save space). Personally, I think this visualization is a little cleaner than the dose-wise box plot shown above.

The mess of red lines in the plot 1 suggest an issue with the assay itself. While the other plots do show differences, it’s not clear what one can conclude from this. For example, in the plot for 4, the dip on the left hand side (i.e., low dose) could suggest that there is a degree of cytotoxicity, which is comparatively less in 3, 5 and 6. Interestingly none of the median curves are really sigmoidal, suggesting that the distribution of dose responses has substantial variance.

Written by Rajarshi Guha

November 30th, 2014 at 3:02 pm

## fingerprint 3.5.2 released

Comparison of nested loop performance in R and C for Tanimoto similarity matrix calculation.

Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.

His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.

Written by Rajarshi Guha

October 27th, 2013 at 10:44 pm

Posted in cheminformatics,software

Tagged with , ,

## Updated version of rcdk (3.2.3)

I’ve pushed updates to the rcdklibs and rcdk packages that support cheminformatics in R using the CDK. The new versions employ the latest CDK master, which as Egon pointed out  has significantly fewer bugs, and thanks to Jon, improved performance. New additions to the package include support for the LINGO and Signature fingerprinters (you’ll need the latest version of fingerprint).

Written by Rajarshi Guha

October 6th, 2013 at 11:17 pm

Posted in cheminformatics,software

Tagged with ,

## Life and death in a screening campaign

So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book  and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.

It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.

How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.

Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.

For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.

OK – it sounds cute, but has some obvious limitations

1. The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
2. A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
3. Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.

The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.

Written by Rajarshi Guha

October 2nd, 2013 at 10:22 pm