So much to do, so little time

Trying to squeeze sense out of chemical data

Search Result for rest — 112 articles

Ranking Dose Response Curves

with 3 comments

UPDATE (3/21) – I was contacted by the author of the paper who pointed out that my analysis was based on a misunderstanding of the paper. Specifically

  1. The primary goal of WES is to identify actives – and according to the authors definition, the most interesting actives (that should be ranked highly) are those that have no dose response and show a constant activity equal to the positive control. Next in importance are compounds that exhibit a dose response. Finally the least interesting (and so lowest ranked) are those that show no dose response and are flat at the negative control level.
  2. The WES method requires that data be normalized such that DMSO (i.e., negative control) is at 0 and positive control is at 100%.

Since my analysis was based on the wrong normalization scheme the conclusions were erroneous. When the proper normalization is taken into account, the method works as advertised in that it correctly ranks compounds that show constant activity at the positive control level at the top, followed by curves with a dose response and finally with inactives (no activity at all) at the bottom.

Based on this I’ve updated the figures and text to correct my mistake. However, in my opinion, if the goal is to identify compounds that have a constant activity one does not need to go to entropy. In addition, for the case of compounds with a well defined dose response, the WES essentially ranks them by potency (assuming a valid curve fit). The updated text goes on to discuss these aspects.

UPDATE (2/25) – Regenerated the enrichment curves so that data was ranked in the correct order when LAC50 was being used.

I came across a paper that describes the use of weighted entropy to rank order dose response curves. As this data type is the bread and butter of my day job a simple ranking method is always of interest to me. While the method works as advertised, it appears to be a rather constrained method and doesn’t seem to do a whole let better than simpler, pre-existing approaches.

The paper correctly notes that there is no definitive protocol to rank compounds using their dose response curves. Such rankings are invariably problem dependent – in some cases, simple potency based ranking of good quality curves is sufficient. In other cases structural clustering combined with a measure of potency enrichment is more suitable. In addition, it is also true that all compounds in a screen do not necessarily fit well to a 4-parameter Hill model. This may simply be due to noise but could also be due to some process that is better fit by some other model (bell or U shaped curves). The point being that rankings based on a pre-defined model may not be useful or accurate.

The paper proposes the use of entropy as a way to rank dose response curves in a model-free manner. While a natural approach is to use Shannon entropy, the author suggests that the equal weighting implicit in the calculation is unsuitable. Instead, the use of weighted entropy (WES) is proposed as a more robust approach that takes into account unreliable data points. The author defines the weights based on the level of detection of the assay (though I’d argue that since the intended goal is to capture the reliability of individual response points, a more appropriate weight should be derived from some form of variance – either from replicate data or else pooled across the collection) . The author then suggests that curves should be ranked by the WES value, with higher values indicating a better rank.

For any proposed ranking scheme, one must first define what the goal is. When ranking dose response curves are we looking for compounds

  • that exhibit well defined dose response (top and bottom asymptotes, > 80% efficacy etc)?
  • good potency, even if the curve is not that well fit?
  • compounds with a specific chemotype?

According to the paper, a key goal is to be able to identify compounds that show a constant activity – and within such compounds the more interesting ones are those that have constant activity = 100%. While I disagree that these are the most interesting compounds, it is not clear why one would need an entropy based method to identify such constant-activity curves (either at 100% or 0%).

w-vs-lac50More generally, for well defined dose response curves, the WES, by definition, tracks potency. This can be seen in the figure alongside that plots the WES value vs the log AC50 for a set of 27 good quality curves taken from a screen of 1408 AR agonists. Granted, when no model can be fit, one does not have an AC50, whereas a WES can be evaluated. But in such a case it’s not clear why one would necessarily want to quantify presumably noisy data.

However, going along with the authors definition, the method does distinguishwes-active-vs-inactive-activation valid dose responses from inactives (though again, one does not require entropy to make such a distinction!) as shown in the adjoining figure. It is clear from the definition of WES that a curve that is flat at 100% will exhibit the maximum value of WES and so will always rank high.

One way to to test the performance of ranking methods this is to take a collection of curves, rank them by a measure and identify how many actives are identified in the top N% of the collection, for varying N. Ideally, a good ranking would identify nearly all the actives for a small N. If the ranking were random one would identify N% of the actives in the top N% of the collection. Here an active is defined in terms of curve class, a heuristic that we use to initially weed out poor quality curves and focus on good quality ones. I defined active as curve classes 1.1, 1.2, 2.2 and 2.1 (see here for a summary of curve classes).

As pointed out by the author during our conversation, this is not an entirely fair comparison – since my schemewes-enrichment does not consider a flat curve at 100% as active. Though it’s a valid point, the dataset I worked with did not have any such curves. More generally, such curves would be the exception in a qHTS screen (assuming the concentration ranges have been correctly chosen). From that point of view, one should be able to apply WES to generate a ranking for any qHTS screen otherwise one would have to inspect the curves first to ensure that it contains such “flat actives” and then apply WES. Which is not the right way to go about it.

As shown in the enrichment plot shown alongside (generated for the 1408 compound AR agonist dataset), WES works better than random (and much better than the standard Shannon entropy), but is still outperformed by the area under the dose response curve (AUC) and potency. I certainly don’t claim that AUC is a completely robust way to rank dose response curves (in fact for some cases such as invalid curve fits, it’d be nonsensical). I also include LAC50, the logarithm of the AC50, as a ranking method simply because the paper considers it a poor way to rank curves (which I agree with, particularly if one does not first filter for good quality, efficacious curves).

There are a few other issues, though I think the most egregious one was that the method was tested on just one dataset. I’m not convinced that a single dataset represents a sufficient validation (given that Tox21 has about 80 published bioassays in PubChem). But that’s a case of poor reviewing rather than a technical flaw.

Written by Rajarshi Guha

July 23rd, 2014 at 1:52 pm

Which Datasets Lead to Predictive Models?

with 3 comments

I came across a recent paper from the Tropsha group that discusses the issue of modelability – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of reliable is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.

The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state

we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation

I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!

Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.

The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.

One application of MODI that does come to mind is for feature selection - identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.

More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.

Written by Rajarshi Guha

December 4th, 2013 at 4:21 pm

Posted in cheminformatics

Tagged with , , ,

Exploring co-morbidities in medical case studies

with 2 comments

A previous post described a first look at the data available in, primarily looking at summaries of high level meta-data. In this post I start looking at the cases themselves. As I noted previously, BMC has performed some form of biomedical entity recognition on the abstracts (?) of the case studies, resulting in a set of keywords for each case study. The keywords belong to specific types such as Condition, Medication and so on. The focus of this post will be to explore the occurrence of co-morbidities – which conditions occur together, to what extent and whether such occurrences are different from random. The code to extract the co-morbidity data and generate the analyses below is available in

Before doing any analyses we need to do some clean up of the Condition keywords. This includes normalizing terms (replacing ‘comatose’ with ‘coma’, converting all diabetes variants such as Type 1 and Type 2 to just diabetes), fixing spelling variants (replacing ‘foetal’ with ‘fetal’), removing stopwords and so on. The Python code to perform this clean up requires that we manually identify these transformations. I haven’t done this rigorously, so it’s not a totally cleansed dataset. The cleanup code looks like

def cleanTerms(terms):
    repMap = {'comatose':'coma',
              'lymphomas': 'lymphoma',
    stopwords = ['state','syndrome'', low grade', 'fever', 'type ii', 'mellitus', 'type 2', 'type 1', 'systemic', 'homogeneous', 'disease']
    l = []
    term = [x.lower().strip() for x in terms]
    for term in terms:
        for sw in stopwords: term = term.replace(sw, '')
        for key in repMap.keys():
            if term.find(key) >= 0: term = repMap[key]
        term = term.encode("ascii", "ignore").replace('\n','').strip()
    l = filter(lambda x: x != '-', l)

Since each case study can be associated with multiple conditions, we generate a set of unique condition pairs for each case, and collect these for all 28K cases I downloaded previously.

cases = pickle.load(open('cases.pickle'))
allpairs = []
for case in cases:
    ## get all conditions for this case
    conds = filter(lambda x: x['type'] == 'Condition', [x for x in case['keywords']])
    conds = cleanTerms([x['text'] for x in conds])
    if len(conds) == 0: continue
    pairs = [ (x,y) for x,y in list(combinations(conds, 2))]

It turns out that across the whole dataset, there are a total of 991,466 pairs of conditions corresponding to 576,838 unique condition pairs and 25,590 unique conditions. Now, it’s clear that some condition pairs may be causally related (some of which are trivial cases such as cough and infection), whereas others are not. In addition, it is clear that some condition pairs are related in a semantic, rather than causal, fashion – carcinoma and cancer. In the current dataset we can’t differentiate between these classes. One possibility would be to code the conditions using ICD10 and collapse terms using the hierarchy.

Number of co-morbidities vs frequency of occurrence

Number of co-morbidities vs frequency of occurrence

Having said that, we work with what we currently have – and it’s quite sparse. In fact the 28K case studies represent just 0.16% of all possible co-morbidities. Within the set of just under 600K unique observed co-morbidities, the bulk occur just once. For the rest of the analysis we ignore these singleton co-morbidities (leaving us with 513,997  co-morbidities). It’s interesting to see the distribution of frequencies of co-morbidities. The first figure plots the number of co-morbidities that occur at least N times – 99,369 co-morbidities occur 2 or more times in the dataset and so on.

Another way to visualize the data is to plot a pairwise heatmap of conditions. For pairs of conditions that occur in the cases dataset we can calculate the probability of occurrence (i.e., number of times the pair occurs divided by the number of pairs). Furthermore, using a sampling procedure we can evaluate the number of times a given pair would be selected randomly from the pool of conditions. For the current analysis, I used 1e7 samples and evaluated the probability of a co-morbidity occurring by chance. If this probability is greater than the observed probability I label that co-morbidity as not different from random (i.e., insignificant). Ideally, I would evaluate a confidence interval or else evaluate the probability analytically (?).

For the figure below, I considered the 48 co-morbidities (corresponding to 25 unique conditions) that occurred 250 or more times in the dataset. I display the lower triangle for the heatmap – grey indicates no occurrences for a given co-morbidity and white X’s identify co-morbidities that have a non-zero probability of occurrence but are not different from random. As noted above, some of these pairs are not particularly informative – for example, tumor and metastasis occur with a relatively high probability, but this is not too surprising

Probability of occurrence for co-morbidities occurring more than 250 times

Probability of occurrence for co-morbidities occurring more than 250 times

It’s pretty easy to modify to look at other sets of co-morbidities. Ideally, however, we would precompute probabilities for all co-morbidities and then support interactive visualization (maybe using D3).

It’s also interesting to look at co-morbidities that include a specific condition. For example, lets consider tuberculosis (and all variants). There are 948 unique co-morbidities that include tuberculosis as one of the conditions. While the bulk of them occur just twice, there are a number with relatively large frequencies of occurrence – lymphadenopathy co-occurs with tuberculosis 203 times. Rather than tabulate the co-occurring conditions, we can use the frequencies to generate a word cloud, as shown below. As with the co-morbidity heatmaps, this could be easily automated to support interactive exploration. On a related note, it’d be quite interesting to compare the frequencies discussed here with data extracted from a live EHR system

A visualization of conditions most frequently co-occurring with tuberculosis

A visualization of conditions most frequently co-occurring with tuberculosis

So far this has been descriptive – given the size of the data, we should be able to try out some predictive models. Future posts will look at the possibilities of modeling the case studies dataset.

Written by Rajarshi Guha

October 12th, 2013 at 10:43 pm

Exploring medical case studies

with one comment

I recently came across from BMC, a collection of more than 29,000 peer-reviewed case studies collected from a variety of journals. I’ve been increasingly interested in the possibilities of mining clinical data (inspired by impressive work from Atul Butte, Nigam Shah and others), so this seemed like a great resource to explore

The folks at BMC have provided a REST API, which is still in development – as a result, there’s no public documentation and it still has a few rough edges. However, thanks to help from Demitrakis Kavallierou, I was able to interact with the API and extract summary search information as well as 28,998 case studies as of Sept 23, 2013. I’ve made the code to extract case studies available as Running this, gives you two sets of data.

  1. A JSON file for each year between 2000 and 2014, containing the summary results for all cases in that year which includes a summary view of the case, plus facets for a variety of fields (age, condition, pathogen, medication, intervention etc.)
  2. A pickle file containing the case reports, as a list of maps. The case report contains the full abstract, case report identifier and publication meta-data.

A key feature of the case report entries is that BMC has performed some form of entity recognition so that it provides a list of keywords identified by different types: ‘Condition’, ‘Symptom’, ‘Medication’ etc. Each case may have multiple occurences for each type of keyword and importantly, each keyword is associated with the text fragment it is extracted from. As an example consider case 10.1136/bcr.02.2009.1548. The entry extracts two conditions

{u'sentence': u'She was treated by her family physician for presumptive interscapular myositis with anti-inflammatory drugs, cold packs and rest.',
u'text': u'Myositis',
u'type': u'Condition'}


{u'sentence': u'The patient denied any constitutional symptoms and had no cough.',
u'text': u'Cough',
u'type': u'Condition'}

I’m no expert in biomedical entity recognition, but the fact that BMC has performed it, saves me from having to become one, allowing me to dig into the data. But there are the usual caveats associated with text mining – spelling variants, term variants (insulin and insulin therapy are probably equivalent) and so on.

Count of cases deposited per year

Count of cases deposited per year

However, before digging into the cases themselves, we can use the summary data, and especially the facet information (which is, by definition, standardized) to get some quick summaries from the database. For example we see the a steady increase of case studies deposited in the literature over the last decade or so.

Interestingly, the number of unique conditions, medications or pathogens reported for these case studies is more or less constant, though there seems to be a downward trend for conditions. The second graph highlights this trend, by plotting the number of unique facet terms (for three types of facets) per year, normalized by the number of cases deposited that year.

Normalized count of unique facet terms by year

Normalized count of unique facet terms by year

This is a rough count, since I didn’t do any clean up of the text – so that misspellings of the same term (say, acetaminophen and acetaminaphen will be counted as two separate medication facets) may occur.

Another interesting task would be to enrich the dataset with additional annotations - ICD9/ICD10 for conditions, ATC for drugs – which would allow a higher level categorization and linking of case studies. In addition, one could use the CSLS service to convert medication names to chemical structures and employ structural similarity to group case studies.

The database also records some geographical information for each case. Specifically, it lists the countries that the authors are from. While interesting to an extent, it would have been nice if the country of occurrence or country of treatment were specifically extracted from the text. Currently, one might infer that the treatment occurred in the same country as the author is from, but this is likely only true when all authors are from the same country. Certainly, multinational collaborations will hide the true number of cases occurring in a given country (especially so for tropical diseases).

But we can take a look at how the number of cases reported for specific conditions, varies with geography and time. The figure below shows the cases whose conditions included the term tuberculosis

Tuberculosis cases by country and year

Tuberculosis cases by country and year

The code to extract the data from the pickle file is in Assuming you have cases.pickle in your current path, usage is

$ python condition_name

and will output the data into a CSV file, which you can the process using your favorite tools.

In following blog posts, I’ll start looking at the actual case studies themselves. Interesting things to look at include exploring the propensity of co-morbidities, analysing the co-occurrence of conditions and medications or conditions and pathogens, to see whether the set of treatments associated with a given condition (or pathogen) has changed over time. Both these naturally lead to looking at the data with eye towards repurposing events.

Written by Rajarshi Guha

October 10th, 2013 at 7:20 pm

Life and death in a screening campaign

without comments

So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book  and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.

It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.

How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.

Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.

For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.

OK – it sounds cute, but has some obvious limitations

  1. The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
  2. A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
  3. Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.

The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.

Written by Rajarshi Guha

October 2nd, 2013 at 10:22 pm