# So much to do, so little time

Trying to squeeze sense out of chemical data

## Applications Invited for CSA Trust Grant for 2015

The Chemical Structure Association (CSA) Trust is an internationally recognized organization established to promote the critical importance of chemical information to advances in chemical research. In support of its charter, the Trust has created a unique Grant Program and is now inviting the submission of grant applications for 2015.

Purpose of the Grants
The Grant Program has been created to provide funding for the career development of young researchers who have demonstrated excellence in their education, research or development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and compounds. One or more Grants will be awarded annually up to a total combined maximum of ten thousand U.S. dollars ($10,000). Grants are awarded for specific purposes, and within one year each grantee is required to submit a brief written report detailing how the grant funds were allocated. Grantees are also requested to recognize the support of the Trust in any paper or presentation that is given as a result of that support. Who is Eligible? Applicant(s), age 35 or younger, who have demonstrated excellence in their chemical information related research and who are developing careers that have the potential to have a positive impact on the utility of chemical information relevant to chemical structures, reactions and compounds, are invited to submit applications. While the primary focus of the Grant Program is the career development of young researchers, additional bursaries may be made available at the discretion of the Trust. All requests must follow the application procedures noted below and will be weighed against the same criteria. Which Activities are Eligible? Grants may be awarded to acquire the experience and education necessary to support research activities; e.g. for travel to collaborate with research groups, to attend a conference relevant to one’s area of research, to gain access to special computational facilities, or to acquire unique research techniques in support of one’s research. Application Requirements: Applications must include the following documentation: 1. A letter that details the work upon which the Grant application is to be evaluated as well as details on research recently completed by the applicant; 2. The amount of Grant funds being requested and the details regarding the purpose for which the Grant will be used (e.g. cost of equipment, travel expenses if the request is for financial support of meeting attendance, etc.). The relevance of the above-stated purpose to the Trust’s objectives and the clarity of this statement are essential in the evaluation of the application); 3. A brief biographical sketch, including a statement of academic qualifications; 4. Two reference letters in support of the application. Additional materials may be supplied at the discretion of the applicant only if relevant to the application and if such materials provide information not already included in items 1-4. Three copies of the complete application document must be supplied for distribution to the Grants Committee. Deadline for Applications Applications for the 2015 Grant is March 13, 2015. Successful applicants will be notified no later than May 2nd of the relevant year. Address for Submission of Applications The application documentation should be forwarded to: Bonnie Lawlor, CSA Trust Grant Committee Chair, 276 Upper Gulph Road, Radnor, PA 19087, USA. If you wish to enter your application by e-mail, please contact Bonnie Lawlor at chescot@aol.com prior to submission so that she can contact you if the e-mail does not arrive. Written by Rajarshi Guha February 2nd, 2015 at 5:20 pm Posted in cheminformatics Tagged with , , ## Ridit Analysis without comments While preparing material for an R workshop I ran at NCATS I was pointed to the topic of ridit analysis. Since I hadn’t heard of this technique before I decided to look into it and investigate how R could be used for such an analysis (and yes, there is a package for it). ## Why ridit analysis? First, lets consider why one might consider ridit analysis. In many scenarios one might have data that is categorical, but the categories are ordered. This type of data is termed ordinal (sometimes also called “nominal with order”). An example might be a trial of an analgesics ability to reduce pain, whose outcome could be no pain, some pain, extreme pain. While there are three categories, it’s clear that there is an ordering to them. Analysis of such data usually makes use of methods devised for categorical data – but such methods will not make use of the information contained within the ordering of the categories. Alternatively, one might numerically code the groups using 1, 2 and 3 and then apply methods devised for continuous or discrete variables. This is not appropriate since one can change the results by simply changing the category coding. Ridit analysis essentially transforms ordinal data to a probability scale (one could call it a virtual continuous scale). The term actually stands for relative to an identified distribution integral transformation and is analoguous to probit or logit. (Importantly, ridit analysis is closely related to the Wilcoxon rank sum test. As shown by Selvin, the Wilcoxon test statistic and the mean ridit are directly related). ## Definitions Essentially, one must have at least two groups, one of which is selected as the reference group. Then for the non-reference group, the mean ridit is an estimate of the probability that a random individual from that group will have a value on the underlying (virtual) continuous scale greater than or equal to the value for a random individual from the reference group. So if larger values of the underlying scale imply a worse condition, then the mean ridit is the probability estimate that the random individual from the group is worse of than a random individual from the reference group (based on the interpretation from Bross). Based on the definition of a ridit (see here or here), one can compute confidence intervals (CI) or test the hypothesis that different groups have equal mean ridits. Lets see how we can do that using R ## Mechanics of ridit analysis Consider a dataset taken from Donaldson, (Eur. J. Pain, 1988) which looked at the effect of high and low levels of radiation treatment on trials participants’s sleep. The numbers are counts of patients:  123456789101112 sleep <- data.frame(pain.level=factor(c('Slept all night with no pain', 'Slept all night with some pain', 'Woke with pain - medication provided relief', 'Woke with pain - medication provided no relief', 'Awake most or all of night with pain'), levels=c('Slept all night with no pain', 'Slept all night with some pain', 'Woke with pain - medication provided relief', 'Woke with pain - medication provided no relief', 'Awake most or all of night with pain')), low.dose=c(3, 10, 6, 2, 1), high.dose=c(6,10,2,0,0)) Here the groups are in the columns (low.dose and high.dose) and the categories are ordered such tat Awake most or all of night with pain is the “maximum” category. To compute the mean ridits for each dose group we first reorder the table and then convert the counts to proportions and then compute ridits for each category (i.e., row).  1234567891011121314151617 ## reorder table sleep <- sleep[ length(levels(sleep$pain.level)):1, ] ## compute proportions sleep$low.dose.prop <- sleep$low.dose / sum(sleep$low.dose) sleep$high.dose.prop <- sleep$high.dose / sum(sleep$high.dose) ## compute riddit ridit <- function(props) { ## props should be in order of levels (highest to lowest)   r <- rep(-1, length(props))   for (i in 1:length(props)) {     if (i == length(props)) vals <- 0     else vals <- props[(i+1):length(props)]     r[i] <- sum(vals) + 0.5*props[i]   }   return(r) } sleep$low.dose.ridit <- ridit(sleep$low.dose.prop) sleep$high.dose.ridit <- ridit(sleep$high.dose.prop)

The resultant table is below

 123456 pain.level low.dose high.dose low.dose.prop high.dose.prop low.dose.ridit high.dose.ridit 5           Awake most or all of night with pain        1         0    0.04545455      0.0000000     0.97727273       1.0000000 4 Woke with pain - medication provided no relief        2         0    0.09090909      0.0000000     0.90909091       1.0000000 3    Woke with pain - medication provided relief        6         2    0.27272727      0.1111111     0.72727273       0.9444444 2                 Slept all night with some pain       10        10    0.45454545      0.5555556     0.36363636       0.6111111 1                   Slept all night with no pain        3         6    0.13636364      0.3333333     0.06818182       0.1666667

The last two columns represent the ridit values for each category and can be interpreted as

a probability estimate that an individuals value on the underlying continuous scale is less than or equal to the midpoint of the corresponding interval

The next step (and main point of the analysis) is to compute the mean ridit for a group (essentially the sum of the category proportions for that group weighted by the category ridits in the reference group) , based on a reference. In this case, lets assume the low dose group is the reference.

 1 mean.r.high <- sum(with(sleep, high.dose.prop * low.dose.ridit))

which is 0.305, and can be interpreted as the probability that a patient receiving the high dose of radiation will experience more sleep interference than a patient in the low dose group. Importantly, since ridits are estimates of probabilities, the complementary ridit (i.e., using the high dose group as reference) comes out to 0.694 and is the probability that a patient in the low radiation dose group will experience more sleep interferance than a patient in the high dose group.

## Statistics on ridits

There are a number of ways to compute CI’s on mean ridits or else test the hypothesis that the mean ridits differ between $$k$$ groups. Donaldsons method for CI calculation appears to be restricted to two groups. In contrast, Fleiss et al suggest an alternate method based on. Considering the latter, the CI for a group vs the reference group is given by

$$\overline{r}_i \pm B \frac{\sqrt{n_s +n_i}}{2\sqrt{3 n_s n_i}}$$

where $$\overline{r}_i$$ is the mean ridit for the $$i$$’th group, $$n_s$$ and $$n_i$$ are the sizes of the reference and query groups, respectively and $$B$$ is the multiple testing corrected standard error. If one uses the Bonferroni correction, it would be $$1.96 \times 1$$ since there is only two groups being compared (and so 1 comparison). Thus the CI for the mean ridit for the low dose group, using the high dose as reference is given by

$$0.694 \pm 1.96 \frac{\sqrt{18 + 22}}{2\sqrt{3 \times 18 \times 22}}$$

which is 0.515 to 0.873. Given that the interval does not include 0.5, we can conclude that there is a statistically significant difference ($$\alpha = 0.05$$) in the mean ridits between the two groups. For the case of multiple groups, the CI for any group vs any other group (i.e., not considering the reference group) is given by

$$(\overline{r}_i – \overline{r}_j + 0.5) \pm B \frac{\sqrt{n_i + n_j}}{2\sqrt{3 n_i n_j}}$$

Fleiss et al also describes how one can test the hypothesis that the mean ridits across all groups (including the reference) are equal using a $$\chi^2$$ statistic. In addition, they also describe how one can perform the same test between any group and the reference group.

## R Implementation

I’ve implemented a function that computes mean ridits and their 95% confidence interval (which can be changed). It expects that the data is provided as counts for each category and that the input data.frame is ordered in descending order of the categories. You need to specify the variable representing the categories and the reference variable. As an example of its usage, we use the dataset from Fleiss et al which measured the degree of pain relief provided by different drugs after oral surgery. We perfrom a ridit analysis using aspirin as the reference group:

 12345678 dental <- data.frame(pain.relief = factor(c('Very good', 'Good', 'Fair', 'Poor', 'None'),                        levels=c('Very good', 'Good', 'Fair', 'Poor', 'None')),                      ibuprofen.low = c(61, 17, 10, 6, 0),                      ibuprofen.high = c(52, 25, 5, 3, 1),                      Placebo = c(32, 37, 10, 18, 0),                                          Aspirin = c(47, 25, 11, 4, 1)                      ) ridit(dental, 'pain.relief', 'Aspirin')

which gives us

 1234567891011121314151617 $category.ridit pain.relief ibuprofen.low ibuprofen.high Placebo 1 Very good 0.67553191 0.697674419 0.83505155 2 Good 0.26063830 0.250000000 0.47938144 3 Fair 0.11702128 0.075581395 0.23711340 4 Poor 0.03191489 0.029069767 0.09278351 5 None 0.00000000 0.005813953 0.00000000$mean.ridit  ibuprofen.low ibuprofen.high        Placebo      0.5490812      0.5455206      0.3839620 $ci group low high 1 ibuprofen.low 0.4361128 0.6620496 2 ibuprofen.high 0.4300396 0.6610016 3 Placebo 0.2718415 0.4960826 The results suggest that patients receiving either dose of ibuprofen will get better pain relief compared to aspirin. However, if you consider the CI’s it’s clear that they both contain 0.5 and thus there is no statistical difference in the mean ridits for these two doses, compared to aspirin. On the other hand, placebo definitely leads to less pain relief compared to aspirin. Written by Rajarshi Guha January 18th, 2015 at 7:26 pm ## Thoughts on the DREAM Synergy Prediction Challenge without comments The DREAM consortium has run a number of predictive modeling challenges and the latest one on predicting small molecule synergies has just been published. The dataset that was provided included baseline gene expression of the cell line (OCI-LY3), expression in presence of compound (2 concentrations, 2 time points), dose response data for 14 compounds and the excess over Bliss for the 91 pairs formed from the 14 compounds. Based on this data (and available literature data) participants had to predict a ranking for the 91 combinations. The paper reports the results of 31 approaches (plus one method that was not compared to the others) and does a good job of summarizing their performance and identifying whether certain data type or certain approaches work better than others. They also investigated the performance of an ensemble of approaches, which, as one might expect, worked better than the single methods. While the importance of gene expression in predictive performance was not as great as I would’ve thought, it was certainly more useful than chemical structure alone. Interestingly, they also noted that “compounds with more targeted mechanisms, such as rapamycin and blebbistatin, were least synergistic“. I suspect that this is somewhat dataset specific, but it will be interesting to see whether this holds in large collections of combination experiment such as those run at NCATS. Overall, it’s an important contribution with the key take home message being … synergy and antagonism are highly context specific and are thus not universal properties of the compounds’ chemical, structural or substrate information. As a result, predictive methods that account for the genetics and regulatory architecture of the context will become increasingly relevant to generalize results across multiple contexts Given the relative dearth of predictive models of compound synergy, this paper is a nice compilation of methods. But there are some issues that weaken the paper. • One key issue are the conclusions on model performance. The organizers defined a score, termed probabilistic c-score (PC score). If I understand correctly, a random ranking should give PC = 0.5. It turns out that the best performing method exhibited a PC score = 0.61 with a number of methods hovering around 0.5. Undoubtably, this is a tough problem, but when the authors states that “… this challenge shows that current methodologies can perform significantly better than chance …” I raise an eyebrow. I can only assume that what they meant was that the results were “statistically significantly better than chance“, because in terms of effect size the results are not impressive. After reading this excellent article on p-values and significance testing I’m particularly sensitized to claims of significance. • The dataset could have been strengthened by the inclusion of self-crosses. This would’ve allowed the authors to assess actual excess over Bliss values corresponding to additivity (which will not be exactly 0 due to experimental noise), and avoid the use of cutoffs in determining what is synergistic or antagonistic. • Similarly, a key piece of data that would really strengthen these approaches is the expression data in presence of combinations. While it’s unreasonable to have this data available for all combinations, it could be used as a first step in developing models to predict the expression profile in presence of combination treatment. Certainly, such data could be used to validate some assumptions made by some of the models described (e.g., concordance of DEG’s induced by single agents implies synergistic response). • Kudos for including source code for the top methods, but would’ve been nicer if data files were included so we could actually reproduce the results. • The authors conclude that when designing new synergy experiments, one should identify mechanistically diverse molecules to make up for the “small number of potentially synergistic pathways“. While mechanistic diversity is a good idea, it’s not clear how they conclude there are a small number of pathways that play a role in synergy. • It’s a pity that the SynGen method was not compared to the other methods. While the authors provide a justification, it seems rather weak. The method only applied to the synergistic combinations (performance was not a whole lot better than random – true positive rate of 56%) – but the text indicates that it predicted synergistic compound pairs. It’s not clear whether this means it made a call on synergy or a predicted ranking. If the latter it would’ve been interesting to see how it compared to the rankings of the synergistic subset of 91 compounds from other methods. Written by Rajarshi Guha November 20th, 2014 at 5:37 pm ## Metabolite Similarity & Dirty Compounds with one comment Edit 10/9/14 – Updated statistics for the 1024 bit fingerprints There’s been some discussion about a paper by O’Hagan et al that have proposed a Rule of 0.5 that states that 90% of approved drugs exhibit a Tanimoto similarity > 0.5 to one or more human metabolites. Their analysis is based on metabolites listed in Recon2, a reconstruction of the human metabolic network. The idea makes sense and there’s an in depth discussion at In the Pipeline. Given the authors’ claim that a successful drug is likely to lie within a Tanimoto distance of 0.5 of a known human metabolite. While this does not mean, of course, that a molecule obeying the rule is likely to become a marketed drug for humans, it does mean that a molecule that fails to obey the rule is statistically most unlikely to do so I was interested in seeing how this rule of thumb holds up when faced with compounds that are not supposed to make it through the drug development pipeline. Since PAINS appear to be the structural filter du jour, I decided to look at compounds that failed the PAINS filter. I worked with the 10,000 compounds included in Saubern et al. Simon Saubern provided me the set of 861 compounds that failed the PAINS filters, allowing me to extract the set of compounds that passed (9139) Chris Swain was kind enough to extract the compound entries from the Matlab dump provided by O’Hagan et al. This file contained InChI representations for a subset of the entries. I extracted the 2980 valid InChI strings and converted them to SMILES using ChemAxon molconvert 6.0.5. The processed data (metabolite name, InChI and SMILES) are available here. However, after deduplication, there were 1335 unique metabolites Now, O’Hagan et al for some reason, used the 166 bit MACCS keys, but hashed them to 1024 bits. Usually, when using a keyed fingerprint, the goal is to retain the correspondence between bit position and substructure. The hashing step results in a loss of such correspondence. So it’s a bit surprising that they didn’t use some sort of path (Daylight) or environment (ECFPn) based fingerprint. Since I didn’t know how they hashed the MACCS keys, I calculated 166 bit MACCS keys and 1024 bt ECFP6 and extended path fingerprints using the CDK (via rcdk). Then for each compound in the PAINS pass or fail set, I computed the similarity to each of the 1335 metabolites and identified the maximum similarity (termed NMTS in the paper) and then plotted the distribution of these NMTS values between the PAINS pass and fail sets. First, the similarity cutoff proposed by the authors is obiously dependent on the fingerprint. So while the bulk of the 166 bit MACCS similarities are > 0.5, this is not really meaningful. A more relevant comparison is to 1024 bit fingerprints – both are hashed, so should be somewhat comparable to the authors choice of hashed MACCS keys. The path fingerprints lead to an NMTS of ~ 0.25 for both PAINS pass and fail sets and the ECFP6 leads to an NMTS of ~ 0.18 for both sets. Though the difference in medians between the pass and fail sets for the path fingerprint is statistically significant (p = 1.498e-05, Wilcoxon test), the difference itself is very small: 0.005. (For the circular fingerprint there is no statistically significant difference). However, the PAINS pass set does contain more outliers with values > 0.5. In that sense the proposed rule does separate the two groups. Of the top of my head I don’t know whether the WEHI screening deck that was the source of the 10,000 compounds was designed to be drug-like. At the same time all this might be saying is there is no relationship between metabolite-likenes and PAINS-likeness. It’d be interesting to see how this type of analysis holds up with other well known filter rules (REOS, Lilly etc). A related thing to look at would be to see how druglikeness scores compare with NMTS values. Code and data are available in this repository Written by Rajarshi Guha October 7th, 2014 at 5:47 pm ## rinchi – An R package to generate InChI’s and InChI Keys with 4 comments While trying to update rcdk on CRAN it was pointed out to me that usage of the library resulted in modifications to the users home directory. Specifically, this occurred when generating InChI‘s. The CDK makes use of jni-inchi, which in turn depends on JNATI which enables Java code to work with native libraries in a platform independent fashion. As part of this, it creates $HOME/.jnati – which is a no-no for CRAN packages. To resolve this, the latest version of rcdklibs excludes the InChI module and its dependencies. Hopefully rcdk and rcdklibs will now pass CRAN QC.

To access InChI functionality in R you can use the rinchi package which is hosted on Github. Since it will modify the users home directory, it cannot be hosted on CRAN. However, it’s easy enough to install

 12 library(devtools) install_github("cdkr", "rajarshi", subdir="rinchi")

Importantly, if all you need is to go from SMILES to InChI, there is no need to install rcdk as well. So the following works

 12 inchi <- get.inchi('CCC') inchik <- get.inchi.key('CCC')

But if you do have a molecule object obtained via rcdk, you can also pass that in to get an InChI or InChI key representation.

Written by Rajarshi Guha

August 30th, 2014 at 6:23 pm

Posted in software,cheminformatics

Tagged with , , , , ,