## Archive for the ‘statistics’ tag

## Life and death in a screening campaign

So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.

It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. *Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”?* One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.

How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.

Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.

For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.

OK – it sounds cute, but has some obvious limitations

- The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
- A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
- Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.

The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.

## When is a Bad Plate Bad?

When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates are good and which ones need to be redone. A common metric is the Z-factor which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.

Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics – SSMD (defined here and applications to quality control discussed here) is regarded as more effective than Z-factor – and in fact it’s advisable to look at multiple metrics rather than depend on any single one.

An alternative trick is to compare the Z-factor for a given plate to the *trimmed* Z-factor, which is evaluated using the trimmed mean and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we’d probably consider plates whose trimmed Z-factor are less than 0.5 and close to the diagonal. (Though for RNAi screens, Z’ = 0.5 might be too stringent).

From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we’d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.

## Correlating Continuous and Categorical Variables

At work, a colleague gave an interesting presentation on characterizing associations between continuous and categorical variables. I expect that I will be facing this issue in some upcoming work so was doing a little reading and made some notes for myself.

Given a continuous variable Y and a categorical variable G, is the distribution of Y independent of the levels of G? One can address this using parametric or non-parametric methods. Due to the superior power of parametric methods we start by considering them and only if we completely fail to satisfy the distributional assumptions of such methods, do we move to the non-parametric methods.

A simple approach is to convert G to an indicator variable matrix, such that each level of G becomes a binary vector, and then use this matrix and Y to fit a linear model. One can then perform an ANOVA on the fit to check whether Y is independent of the groups. This approach assumes that Y is normally distributed, and that variances are equal amongst the groups.

A simple example in R is as follows. We first create a dataset in which there is no difference in the mean or variance across the groups, one in which the means differ and one in which the variances differ

1 2 3 4 5 6 7 8 9 10 11 | G <- factor(sample(c('A','T','C','G'), 100, TRUE, c(0.2,0.2,0.2,0.2))) Y1 <- rnorm(100, mean=10, sd=3) Y2 <- Y1 Y2[which(G=='G')] <- sample(rnorm(length(which(G=='G')),mean=4,sd=3)) Y3 <- Y1 Y3[which(G=='G')] <- sample(rnorm(length(which(G=='G')),mean=10,sd=6)) |

A summary of the Y’s grouped by G are given below.

We now run the ANOVA, using the first dataset.

1 2 | fit <- lm(Y1 ~ G, data=data.frame(Y1,G)) anova(fit) |

Running this gives us

1 2 3 4 5 | Analysis of Variance Table Response: Y1 Df Sum Sq Mean Sq F value Pr(>F) G 3 0.74 0.25 0.0304 0.9928 Residuals 96 777.19 8.10 |

If we consider the 0.05 significance level, we would not reject the null hpothesis that Y is independent of G. If we now rerun the ANOVA using the second dataset

1 2 | fit <- lm(Y2 ~ G, data=data.frame(Y2,G)) anova(fit) |

we get

1 2 3 4 5 | Analysis of Variance Table Response: Y2 Df Sum Sq Mean Sq F value Pr(>F) G 3 1131.65 377.22 49.321 < 2.2e-16 *** Residuals 96 734.22 7.65 |

which suggests that we reject the null hypothesis and accept the alternative that there is the Y’s are not independent of the groups (which is true by design).

Furthermore, it really only indicates whether there is a dependence between the groups and the means of the group members – it doesn’t tell us anything about lack of independence in terms of scale parameters. This is not surprising since the linear model is defined to have a constant variance. Thus, if we want to investigate dependence of the variance on groups, we should look at distributions of residuals or other approaches to identifying heteroscedacity (such as Levenes test).

In case we cannot satisfy the assumptions (even after investigating possible transformations of the data) required for ANOVA, we can move to parametric methods, which are easy to run, but have a lower power. For this problem, the Kruskal-Wallis test and the Fligner-Killeen test come in handy. These methods are in the base **stats** package but also come with the coin package which is handy for running a variety of independence tests.

The Kruskal-Wallis test makes no assumptions about normality, but does assume that the distributions amongst the levels of G are of the same shape and have the same variance (and so, the differences are due to differences in the means). The Fligner-Killeen test also avoids the normality assumption bu in this case, has the null hypothesis that the variances in the groups are the same. Thus ideally, we’d run the Kruskal-Wallis first and if we could not reject the null hypothesis we then investigate whether there is a difference in the variances across the groups by the Fligner-Killeen tests.

Some examples highlight the analysis. First, we consider the dataset that is independent of the groups (Y1):

1 2 3 4 5 6 | > kruskal_test(Y1 ~ G, data=data.frame(Y1,G)) Asymptotic Kruskal-Wallis Test data: Y1 by G (A, C, G, T) chi-squared = 0.0628, df = 3, p-value = 0.9959 |

which indicates that we fail to reject the null hypothesis that the distribution of Y1 is independent of the levels of G. We then consider the dataset where the means are dependent on the groups

1 2 3 4 5 6 | > kruskal_test(Y2 ~ G, data=data.frame(Y2,G)) Asymptotic Kruskal-Wallis Test data: Y2 by G (A, C, G, T) chi-squared = 57.3197, df = 3, p-value = 2.196e-12 |

which indicates the we can reject the null hypothesis. However, note that for Y3, where the means are independent of the groups but the variance differ, the Kruskal-Wallis test is not useful:

1 2 3 4 5 6 | > kruskal_test(Y3 ~ G, data=data.frame(Y3,G)) Asymptotic Kruskal-Wallis Test data: Y3 by G (A, C, G, T) chi-squared = 1.5399, df = 3, p-value = 0.673 |

On the other hand, if we run the Fligner-Killeen test on these three datasets, we correctly identify Y3 as having its variance dependent on G

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | > fligner_test(Y1 ~ G, data=data.frame(Y1,G)) Asymptotic Fligner-Killeen Test data: Y1 by G (A, C, G, T) chi-squared = 2.9914, df = 3, p-value = 0.3929 > fligner_test(Y2 ~ G, data=data.frame(Y2,G)) Asymptotic Fligner-Killeen Test data: Y2 by G (A, C, G, T) chi-squared = 2.8202, df = 3, p-value = 0.4202 > fligner_test(Y3 ~ G, data=data.frame(Y3,G)) Asymptotic Fligner-Killeen Test data: Y3 by G (A, C, G, T) chi-squared = 15.3493, df = 3, p-value = 0.001541 |