So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘research’ Category

A Model Building IDE?

with 3 comments

Recently I came across a NIPS2015 paper from Vartak et al that describes a system (APIs + visual frontend) to support the iterative model building process. The problem they are addressing is common one in most machine learning settings – building multiple models (different type) using various features and identifying one or more optimal models to take into production. As they correctly point out, most tools such as scikit-learn, SparkML, etc. focus on providing methods and interfaces to build a single model – it’s up to the user to manage the multiple models, keep track of their performance metrics.

sherlock

My first reaction was, “why?”. Part of this stems from my use of the R environment which allows me to build up a infrastructure for building multiple models (e.g., caret, e1701), storing them (list of model objects, RData binary files or even pmml) and subsequent comparison and summarization. Naturally, this is specific to me (so not really a general solution) and essentially a series of R commands – I don’t have the ability to monitor model building progress and simultaneously inspect models that have been built.

In that sense, the authors statement,

For the most part, exploration of models takes place sequentially. Data scientists must write (repeated) imperative code to explore models one after another.

is correct (though maybe, said data scientists should improve their programming skills?). So a system that allows one to organize an exploration of model space could be useful. However, other statements such as

  • Without a history of previously trained models, each model must be trained from scratch, wasting computation and increasing response times.
  • In the absence of history, comparison between models becomes challenging and requires the data scientist to write special-purpose code.

seem to be relevant only when you are not already working in an environment that supports model objects and their associated infrastructure. In my workflow, the list of model object represents my history!

Their proposed system called SHERLOCK is built on top of SparkML and provides an API to model building, a database for model storage and a graphical interface (see above) to all of this. While I’m not convinced of some of the design goals (e.g., training variations of models based on previously trained models (e.g., with different feature sets) – wouldn’t you need to retrain the model from scratch if you choose a new feature set?), it’ll be interesting to see how this develops. Certainly, the UI will be key – since it’s pretty easy to generate a multitude of models with a multitude of features, but in the end a human needs to make sense of it.

On a separate note, this sounds like something RStudio could take a stab at.

Written by Rajarshi Guha

December 7th, 2015 at 3:09 am

Surveying the Opinion of Chemists

without comments

As part of a project I was wondering about reports of surveys that collected chemists assessments of differnt things. More specifically, I wasn’t looking for crowd-sourcing efforts for data curation (such as the in the Spectral Game) or data collection. Rather, I was interested in reports where somebody asked a group of chemists what they thought of some particular molecular “feature”. Here, “feature” is pretty broadly defined and could range from the quality of a probe molecule to whether a molecule is complex or not.

Surveying the literature (and with pointers from @dgelemi, @baoilleach, Jun Li, @georgeisyourman and @DrBostrom) here’s the following papers:

The number of people surveyed across these studies ranges from less than 10 to more than 300. Recently there appears to be a trend towards developing predictive models based on the results of such surveys. Also, molecular complexity seems pretty popular. Modeling opinion is always a tricky thing, though in my mind some aspects (e.g., complexity, diversity) lend themselves to more robust models than others (e.g., quality of a probe).

If there are other examples of such surveys in chemistry, I’d appreciate any pointers

Written by Rajarshi Guha

November 7th, 2015 at 1:21 pm

Post-doc (Molecular Informatics) Opening at NCATS

without comments

I have a post-doc opening in the Informatics group at NCATS, to work on computational aspects of high throughput combination screening – topics will include predicting drug combination response, visualizing large combination screens (> 5000 combinations) and so on. The NCATS combination screening platform thas tested more than 65,000 compound combinations (in checkerboard style which means more than 4.5M individual dose combinations) along with single agent dose responses. You can view publicly released data at https://tripod.nih.gov/matrix-client.

The NCATS Informatics group is a collection of very smart people, with wide ranging interests in molecular informatics. We work closely with colleagues in biology and chemistry. As a result, we eat a lot of our own dog food. In addition, we’re committed to implementing our ideas in publicly available software tools as well as publishing in journals.

Lots of data, great people and tough problems. If this piques your interest visit the job posting for more details

Written by Rajarshi Guha

February 6th, 2015 at 8:52 pm

Applications Invited for CSA Trust Grant for 2015

without comments

The Chemical Structure Association (CSA) Trust is an internationally recognized organization established to promote the critical importance of chemical information to advances in chemical research. In support of its charter, the Trust has created a unique Grant Program and is now inviting the submission of grant applications for 2015.

Purpose of the Grants
The Grant Program has been created to provide funding for the career development of young researchers who have demonstrated excellence in their education, research or development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and compounds. One or more Grants will be awarded annually up to a total combined maximum of ten thousand U.S. dollars ($10,000). Grants are awarded for specific purposes, and within one year each grantee is required to submit a brief written report detailing how the grant funds were allocated. Grantees are also requested to recognize the support of the Trust in any paper or presentation that is given as a result of that support.

Who is Eligible?
Applicant(s), age 35 or younger, who have demonstrated excellence in their chemical information related research and who are developing careers that have the potential to have a positive impact on the utility of chemical information relevant to chemical structures, reactions and compounds, are invited to submit applications. While the primary focus of the Grant Program is the career development of young researchers, additional bursaries may be made available at the discretion of the Trust. All requests must follow the application procedures noted below and will be weighed against the same criteria.

Which Activities are Eligible?
Grants may be awarded to acquire the experience and education necessary to support research activities; e.g. for travel to collaborate with research groups, to attend a conference relevant to one’s area of research, to gain access to special computational facilities, or to acquire unique research techniques in support of one’s research.

Application Requirements:
Applications must include the following documentation:

  1. A letter that details the work upon which the Grant application is to be evaluated as well as details on research recently completed by the applicant;
  2. The amount of Grant funds being requested and the details regarding the purpose for which the Grant will be used (e.g. cost of equipment, travel expenses if the request is for financial support of meeting attendance, etc.). The relevance of the above-stated purpose to the Trust’s objectives and the clarity of this statement are essential in the evaluation of the application);
  3. A brief biographical sketch, including a statement of academic qualifications;
  4. Two reference letters in support of the application. Additional materials may be supplied at the discretion of the applicant only if relevant to the application and if such materials provide information not already included in items 1-4. Three copies of the complete application document must be supplied for distribution to the Grants Committee.

Deadline for Applications
Applications for the 2015 Grant is March 13, 2015. Successful applicants will be notified no later than May 2nd of the relevant year.

Address for Submission of Applications
The application documentation should be forwarded to: Bonnie Lawlor, CSA Trust Grant Committee Chair, 276 Upper Gulph Road, Radnor, PA 19087, USA. If you wish to enter your application by e-mail, please contact Bonnie Lawlor at chescot@aol.com prior to submission so that she can contact you if the e-mail does not arrive.

Written by Rajarshi Guha

February 2nd, 2015 at 5:20 pm

Posted in cheminformatics

Tagged with , ,

Ridit Analysis

without comments

While preparing material for an R workshop I ran at NCATS I was pointed to the topic of ridit analysis. Since I hadn’t heard of this technique before I decided to look into it and investigate how R could be used for such an analysis (and yes, there is a package for it).

Why ridit analysis?

First, lets consider why one might consider ridit analysis. In many scenarios one might have data that is categorical, but the categories are ordered. This type of data is termed ordinal (sometimes also called “nominal with order”). An example might be a trial of an analgesics ability to reduce pain, whose outcome could be no pain, some pain, extreme pain. While there are three categories, it’s clear that there is an ordering to them. Analysis of such data usually makes use of methods devised for categorical data – but such methods will not make use of the information contained within the ordering of the categories. Alternatively, one might numerically code the groups using 1, 2 and 3 and then apply methods devised for continuous or discrete variables. This is not appropriate since one can change the results by simply changing the category coding.

Ridit analysis essentially transforms ordinal data to a probability scale (one could call it a virtual continuous scale). The term actually stands for relative to an identified distribution integral transformation and is analoguous to probit or logit. (Importantly, ridit analysis is closely related to the Wilcoxon rank sum test. As shown by Selvin, the Wilcoxon test statistic and the mean ridit are directly related).

Definitions

Essentially, one must have at least two groups, one of which is selected as the reference group. Then for the non-reference group, the mean ridit is an

estimate of the probability that a random individual from that group will have a value on the underlying (virtual) continuous scale greater than or equal to the value for a random individual from the reference group.

So if larger values of the underlying scale imply a worse condition, then the mean ridit is the probability estimate that the random individual from the group is worse of than a random individual from the reference group (based on the interpretation from Bross). Based on the definition of a ridit (see here or here), one can compute confidence intervals (CI) or test the hypothesis that different groups have equal mean ridits. Lets see how we can do that using R

Mechanics of ridit analysis

Consider a dataset taken from Donaldson, (Eur. J. Pain, 1988) which looked at the effect of high and low levels of radiation treatment on trials participants’s sleep. The numbers are counts of patients:

1
2
3
4
5
6
7
8
9
10
11
12
sleep <- data.frame(pain.level=factor(c('Slept all night with no pain',
                      'Slept all night with some pain',
                      'Woke with pain - medication provided relief',
                      'Woke with pain - medication provided no relief',
                      'Awake most or all of night with pain'),
                      levels=c('Slept all night with no pain',
                        'Slept all night with some pain',
                        'Woke with pain - medication provided relief',
                        'Woke with pain - medication provided no relief',
                        'Awake most or all of night with pain')),
                    low.dose=c(3, 10, 6,  2, 1),
                    high.dose=c(6,10,2,0,0))

Here the groups are in the columns (low.dose and high.dose) and the categories are ordered such tat Awake most or all of night with pain is the “maximum” category. To compute the mean ridits for each dose group we first reorder the table and then convert the counts to proportions and then compute ridits for each category (i.e., row).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## reorder table
sleep <- sleep[ length(levels(sleep$pain.level)):1, ]
## compute proportions
sleep$low.dose.prop <- sleep$low.dose / sum(sleep$low.dose)
sleep$high.dose.prop <- sleep$high.dose / sum(sleep$high.dose)
## compute riddit
ridit <- function(props) { ## props should be in order of levels (highest to lowest)
  r <- rep(-1, length(props))
  for (i in 1:length(props)) {
    if (i == length(props)) vals <- 0
    else vals <- props[(i+1):length(props)]
    r[i] <- sum(vals) + 0.5*props[i]
  }
  return(r)
}
sleep$low.dose.ridit <- ridit(sleep$low.dose.prop)
sleep$high.dose.ridit <- ridit(sleep$high.dose.prop)

The resultant table is below

1
2
3
4
5
6
                                      pain.level low.dose high.dose low.dose.prop high.dose.prop low.dose.ridit high.dose.ridit
5           Awake most or all of night with pain        1         0    0.04545455      0.0000000     0.97727273       1.0000000
4 Woke with pain - medication provided no relief        2         0    0.09090909      0.0000000     0.90909091       1.0000000
3    Woke with pain - medication provided relief        6         2    0.27272727      0.1111111     0.72727273       0.9444444
2                 Slept all night with some pain       10        10    0.45454545      0.5555556     0.36363636       0.6111111
1                   Slept all night with no pain        3         6    0.13636364      0.3333333     0.06818182       0.1666667

The last two columns represent the ridit values for each category and can be interpreted as

a probability estimate that an individuals value on the underlying continuous scale is less than or equal to the midpoint of the corresponding interval

The next step (and main point of the analysis) is to compute the mean ridit for a group (essentially the sum of the category proportions for that group weighted by the category ridits in the reference group) , based on a reference. In this case, lets assume the low dose group is the reference.

1
mean.r.high <- sum(with(sleep, high.dose.prop * low.dose.ridit))

which is 0.305, and can be interpreted as the probability that a patient receiving the high dose of radiation will experience more sleep interference than a patient in the low dose group. Importantly, since ridits are estimates of probabilities, the complementary ridit (i.e., using the high dose group as reference) comes out to 0.694 and is the probability that a patient in the low radiation dose group will experience more sleep interferance than a patient in the high dose group.

Statistics on ridits

There are a number of ways to compute CI’s on mean ridits or else test the hypothesis that the mean ridits differ between \(k\) groups. Donaldsons method for CI calculation appears to be restricted to two groups. In contrast, Fleiss et al suggest an alternate method based on. Considering the latter, the CI for a group vs the reference group is given by

\(\overline{r}_i \pm B \frac{\sqrt{n_s +n_i}}{2\sqrt{3 n_s n_i}}\)

where \(\overline{r}_i\) is the mean ridit for the \(i\)’th group, \(n_s\) and \(n_i\) are the sizes of the reference and query groups, respectively and \(B\) is the multiple testing corrected standard error. If one uses the Bonferroni correction, it would be \(1.96 \times 1\) since there is only two groups being compared (and so 1 comparison). Thus the CI for the mean ridit for the low dose group, using the high dose as reference is given by

\(0.694 \pm 1.96 \frac{\sqrt{18 + 22}}{2\sqrt{3 \times 18 \times 22}}\)

which is 0.515 to 0.873. Given that the interval does not include 0.5, we can conclude that there is a statistically significant difference (\(\alpha = 0.05\)) in the mean ridits between the two groups. For the case of multiple groups, the CI for any group vs any other group (i.e., not considering the reference group) is given by

\( (\overline{r}_i – \overline{r}_j + 0.5) \pm B \frac{\sqrt{n_i + n_j}}{2\sqrt{3 n_i n_j}}\)

Fleiss et al also describes how one can test the hypothesis that the mean ridits across all groups (including the reference) are equal using a \(\chi^2\) statistic. In addition, they also describe how one can perform the same test between any group and the reference group.

R Implementation

I’ve implemented a function that computes mean ridits and their 95% confidence interval (which can be changed). It expects that the data is provided as counts for each category and that the input data.frame is ordered in descending order of the categories. You need to specify the variable representing the categories and the reference variable. As an example of its usage, we use the dataset from Fleiss et al which measured the degree of pain relief provided by different drugs after oral surgery. We perfrom a ridit analysis using aspirin as the reference group:

1
2
3
4
5
6
7
8
dental <- data.frame(pain.relief = factor(c('Very good', 'Good', 'Fair', 'Poor', 'None'),
                       levels=c('Very good', 'Good', 'Fair', 'Poor', 'None')),
                     ibuprofen.low = c(61, 17, 10, 6, 0),
                     ibuprofen.high = c(52, 25, 5, 3, 1),
                     Placebo = c(32, 37, 10, 18, 0),                    
                     Aspirin = c(47, 25, 11, 4, 1)
                     )
ridit(dental, 'pain.relief', 'Aspirin')

which gives us

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$category.ridit
  pain.relief ibuprofen.low ibuprofen.high    Placebo
1   Very good    0.67553191    0.697674419 0.83505155
2        Good    0.26063830    0.250000000 0.47938144
3        Fair    0.11702128    0.075581395 0.23711340
4        Poor    0.03191489    0.029069767 0.09278351
5        None    0.00000000    0.005813953 0.00000000

$mean.ridit
 ibuprofen.low ibuprofen.high        Placebo
     0.5490812      0.5455206      0.3839620

$ci
           group       low      high
1  ibuprofen.low 0.4361128 0.6620496
2 ibuprofen.high 0.4300396 0.6610016
3        Placebo 0.2718415 0.4960826

The results suggest that patients receiving either dose of ibuprofen will get better pain relief compared to aspirin. However, if you consider the CI’s it’s clear that they both contain 0.5 and thus there is no statistical difference in the mean ridits for these two doses, compared to aspirin. On the other hand, placebo definitely leads to less pain relief compared to aspirin.

Written by Rajarshi Guha

January 18th, 2015 at 7:26 pm