So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘bioinformatics’ Category

Thoughts on the DREAM Synergy Prediction Challenge

without comments

The DREAM consortium has run a number of predictive modeling challenges and the latest one on predicting small molecule synergies has just been published. The dataset that was provided included baseline gene expression of the cell line (OCI-LY3), expression in presence of compound (2 concentrations, 2 time points), dose response data for 14 compounds and the excess over Bliss for the 91 pairs formed from the 14 compounds. Based on this data (and available literature data) participants had to predict a ranking for the 91 combinations.

The paper reports the results of 31 approaches (plus one method that was not compared to the others) and does a good job of summarizing their performance and identifying whether certain data type or certain approaches work better than others. They also investigated the performance of an ensemble of approaches, which, as one might expect, worked better than the single methods. While the importance of gene expression in predictive performance was not as great as I would’ve thought, it was certainly more useful than chemical structure alone. Interestingly, they also noted that “compounds with more targeted mechanisms, such as rapamycin and blebbistatin, were least synergistic“. I suspect that this is somewhat dataset specific, but it will be interesting to see whether this holds in large collections of combination experiment such as those run at NCATS.

Overall, it’s an important contribution with the key take home message being

… synergy and antagonism are highly context specific and are thus not universal properties of the compounds’ chemical, structural or substrate information. As a result, predictive methods that account for the genetics and regulatory architecture of the context will become increasingly relevant to generalize results across multiple contexts

Given the relative dearth of predictive models of compound synergy, this paper is a nice compilation of methods. But there are some issues that weaken the paper.

  • One key issue are the conclusions on model performance. The organizers defined a score, termed probabilistic c-score (PC score). If I understand correctly, a random ranking should give PC = 0.5. It turns out that the best performing method exhibited a PC score = 0.61 with a number of methods hovering around 0.5. Undoubtably, this is a tough problem, but when the authors states that “… this challenge shows that current methodologies can perform significantly better than chance …” I raise an eyebrow. I can only assume that what they meant was that the results were “statistically significantly better than chance“, because in terms of effect size the results are not impressive. After reading this excellent article on p-values and significance testing I’m particularly sensitized to claims of significance.
  • The dataset could have been strengthened by the inclusion of self-crosses. This would’ve allowed the authors to assess actual excess over Bliss values corresponding to additivity (which will not be exactly 0 due to experimental noise), and avoid the use of cutoffs in determining what is synergistic or antagonistic.
  • Similarly, a key piece of data that would really strengthen these approaches is the expression data in presence of combinations. While it’s unreasonable to have this data available for all combinations, it could be used as a first step in developing models to predict the expression profile in presence of combination treatment. Certainly, such data could be used to validate some assumptions made by some of the models described (e.g., concordance of DEG’s induced by single agents implies synergistic response).
  • Kudos for including source code for the top methods, but would’ve been nicer if data files were included so we could actually reproduce the results.
  • The authors conclude that when designing new synergy experiments, one should identify mechanistically diverse molecules to make up for the “small number of potentially synergistic pathways“. While mechanistic diversity is a good idea, it’s not clear how they conclude there are a small number of pathways that play a role in synergy.
  • It’s a pity that the SynGen method was not compared to the other methods. While the authors provide a justification, it seems rather weak. The method only applied to the synergistic combinations (performance was not a whole lot better than random – true positive rate of 56%) – but the text indicates that it predicted synergistic compound pairs. It’s not clear whether this means it made a call on synergy or a predicted ranking. If the latter it would’ve been interesting to see how it compared to the rankings of the synergistic subset of 91 compounds from other methods.

Written by Rajarshi Guha

November 20th, 2014 at 5:37 pm

Exploring medical case studies

with one comment

I recently came across from BMC, a collection of more than 29,000 peer-reviewed case studies collected from a variety of journals. I’ve been increasingly interested in the possibilities of mining clinical data (inspired by impressive work from Atul Butte, Nigam Shah and others), so this seemed like a great resource to explore

The folks at BMC have provided a REST API, which is still in development – as a result, there’s no public documentation and it still has a few rough edges. However, thanks to help from Demitrakis Kavallierou, I was able to interact with the API and extract summary search information as well as 28,998 case studies as of Sept 23, 2013. I’ve made the code to extract case studies available as Running this, gives you two sets of data.

  1. A JSON file for each year between 2000 and 2014, containing the summary results for all cases in that year which includes a summary view of the case, plus facets for a variety of fields (age, condition, pathogen, medication, intervention etc.)
  2. A pickle file containing the case reports, as a list of maps. The case report contains the full abstract, case report identifier and publication meta-data.

A key feature of the case report entries is that BMC has performed some form of entity recognition so that it provides a list of keywords identified by different types: ‘Condition’, ‘Symptom’, ‘Medication’ etc. Each case may have multiple occurences for each type of keyword and importantly, each keyword is associated with the text fragment it is extracted from. As an example consider case 10.1136/bcr.02.2009.1548. The entry extracts two conditions

{u'sentence': u'She was treated by her family physician for presumptive interscapular myositis with anti-inflammatory drugs, cold packs and rest.',
u'text': u'Myositis',
u'type': u'Condition'}


{u'sentence': u'The patient denied any constitutional symptoms and had no cough.',
u'text': u'Cough',
u'type': u'Condition'}

I’m no expert in biomedical entity recognition, but the fact that BMC has performed it, saves me from having to become one, allowing me to dig into the data. But there are the usual caveats associated with text mining – spelling variants, term variants (insulin and insulin therapy are probably equivalent) and so on.

Count of cases deposited per year

Count of cases deposited per year

However, before digging into the cases themselves, we can use the summary data, and especially the facet information (which is, by definition, standardized) to get some quick summaries from the database. For example we see the a steady increase of case studies deposited in the literature over the last decade or so.

Interestingly, the number of unique conditions, medications or pathogens reported for these case studies is more or less constant, though there seems to be a downward trend for conditions. The second graph highlights this trend, by plotting the number of unique facet terms (for three types of facets) per year, normalized by the number of cases deposited that year.

Normalized count of unique facet terms by year

Normalized count of unique facet terms by year

This is a rough count, since I didn’t do any clean up of the text – so that misspellings of the same term (say, acetaminophen and acetaminaphen will be counted as two separate medication facets) may occur.

Another interesting task would be to enrich the dataset with additional annotations - ICD9/ICD10 for conditions, ATC for drugs – which would allow a higher level categorization and linking of case studies. In addition, one could use the CSLS service to convert medication names to chemical structures and employ structural similarity to group case studies.

The database also records some geographical information for each case. Specifically, it lists the countries that the authors are from. While interesting to an extent, it would have been nice if the country of occurrence or country of treatment were specifically extracted from the text. Currently, one might infer that the treatment occurred in the same country as the author is from, but this is likely only true when all authors are from the same country. Certainly, multinational collaborations will hide the true number of cases occurring in a given country (especially so for tropical diseases).

But we can take a look at how the number of cases reported for specific conditions, varies with geography and time. The figure below shows the cases whose conditions included the term tuberculosis

Tuberculosis cases by country and year

Tuberculosis cases by country and year

The code to extract the data from the pickle file is in Assuming you have cases.pickle in your current path, usage is

$ python condition_name

and will output the data into a CSV file, which you can the process using your favorite tools.

In following blog posts, I’ll start looking at the actual case studies themselves. Interesting things to look at include exploring the propensity of co-morbidities, analysing the co-occurrence of conditions and medications or conditions and pathogens, to see whether the set of treatments associated with a given condition (or pathogen) has changed over time. Both these naturally lead to looking at the data with eye towards repurposing events.

Written by Rajarshi Guha

October 10th, 2013 at 7:20 pm

Cheminformatics and Hotness (or lack thereof)

with 2 comments

A few days back I discussed some thoughts on cheminformatics vis a vis bioinformatics, inspired by a review by Aaron Sterling. In that thread, Steven Salzberg made a comment, stating

… In my opinion, it is not a “hot” field, though, in part for some of the reasons mentioned in the post – particularly the fact that the data in the field is mostly proprietary and/or secret. So they hurt themselves by that behavior. But the other reason I don’t think it is moving that fast is that, unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological advances. In bioinformatics, the amazing progress in automated DNA sequencing has driven the science forward at a tremendous pace …

I agree with Steven and others that cheminformatics is not as “hot” as bioinformatics, based on varying metrics of hotness (groups, publications, funding, etc.). However I think the perceived lack of popularity stems from a number of reasons and that technological pushes are a minor reason. (Andrew Dalke noted some of these in a comment).

1. Lack of publicaly accessible data – this has been mentioned in various places and I believe is a key reason that held back the development of cheminformatics outside industry. This is not to say that academic groups weren’t doing cheminformatics in 70′s and 80′s, but we could’ve had a much richer ecosystem.

In this vein, it’s also important to note that just public structure data, while necessary, would likely not have been sufficient for cheminformatics developemnt. Rather, structure and biological activity data are both requred for the development of novel cheminformatics methodologies. (Of course certain aspects of cheminformatics are are focused purely on chemical structure, and as such do fine in the abensce of publically accesssible activity data).

2. Small molecules can make money directly – this is a primary driver for the previous point. A small molecule with confirmed activity against a target of interest can make somebody a lot of money. It’s not just that one molecule – analogs could be even more potent. As a result, the incentive to hold back swathes of structure and activity data is the financially sensible approach. (Whether this is actually useful is open to debate). On the other hand, sequence data is rarely commercialiable (though use of the sequence could be) and hence much easier to release.

3. Burden of knowledge – as I mentioned in my previous post, I believe that to make headway in many areas of cheminformatics requires some background in chemistry, sincce mathematical abstractions (cf graph representations) only take you so far. As Andrew noted, “Bioinformatics has an “overarching mathematical theory” because it’s based very directly on evolution, encoded in linear sequences“. As a result the theoretical underpinnings of much of bioinformatics make it more accessible to the broader community of CS and mathematics. This is not to say that new mathematical developments are not possible in cheminformatics – it’s just a much more complex topic to tackle.

4. Lack of federal funding – this is really a function of the above three points. The idea that it’s all been done in industry is something I’ve heard before at meetings. Obviously, with poor or no federal funding opportunities, fewer groups see cheminformatics as a “rewarding” field. While I still think the NIH’s cancellation of the ECCR program was pretty dumb, this is not to say that there is no federal funding for cheminformatics. Applications just have to be appropriately spun.

To address Stevens’ point regarding technology driving the science – I disagree. While large scale synthesis is possible in some ways (such as combinatorial libraries, diversity oriented synthesis etc.), just making large numbers of molecules is not really a solution. If it were, we might as well generate them virtually and work from the SMILES.

Instead, what is required is large scale activity measurements. And there have been technology developments that allow one to generate large amounts of structure-actvity data – namely, High Throughput Screening (HTS) technologies. Admittedly, the data so generated is not near the scale of sequencing – but at the same time, compared to sequencing, every HTS project usually requires some form of unique optimization of assay conditions. Added to that, we’re usually looking at a complex system and not just a nucleotide sequence and it’s easy to see why HTS assays are not going to be at the scale of next gen sequencing.

But, much of this technology was relegated to industry. It’s only in the last few years that HTS technology has been accesible outside industry and efforts such as the Molecular Libraries Initiative have made great strides in getting HTS technologies to academics and more importantly, making the results of these screens publicaly available.

As a bioinformatics bystander, while I see reports of next gen sequencing pushing out GBs and TBs of data and hence the need for new bioinformatics methods – I don’t see a whole lot of “new” bioinformatics. To me it seems that its just variations of putting together sequences faster – which seems a rather narrow area, if that’s all that is being pushed by these technological developments. (I have my asbestos underwear on, so feel free to flame)

Certainly, bioinformatics is helped by high profile projects such as the Human Genome Project and the more recent 1000 Genomes project which certainly have great gee-whiz factors.  What might be an equivalent for cheminformatics? I’m not sure – but I’d guess something on the lines of systems biology or systems chemical biology might be a possibility.

Or maybe cheminformatics just needs to become “small molecule bioinformatics”?

Written by Rajarshi Guha

February 22nd, 2011 at 1:51 am

Cheminformatics – the New World for TCS?

with 4 comments

A few weeks back Aaron Sterling posted a review of the Handbook of Cheminformatics Algorithms (in which I have a chapter). Aaron notes

… my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry …

The review/bridging was a pretty thorough summary of the book, but the blog post as well as the comments raised a number of interesting issues that I think are worth discussing. Aaron notes

… Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind …

While the HCA may be the first compilation of cheminformatics-related algorithms in a single place, cheminformatics actually has a pretty long lineage, starting back in the 1960′s. Examples include canonicalization (Morgan, 1965) and ring perception (Hendrickson, 1961). See here for a short history of cheminformatics. Granted these are not CS journals, but that doesn’t mean that cheminformatics is a new field. Bioinformatics also seems to have a similar lineage (see this Biostar thread) with some seminal papers from the 1960′s (Dayhoff et al, 1962). Interestingly, it seems that much of the most-cited literature (alignments etc.) in bioinformatics comes from the 90′s.

Aaron then goes onto note that “there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA“. In some ways this is correct – a number of cheminformatics topics could be considered ad-hoc, rather than grounded in rigorous mathematical proofs. But there are topics, primarily in the graph theoretical areas, that are pretty rigorous. I think Aarons choice of complexity descriptors as an example is not particularly useful – granted it is easy to understand without a background in cheminformatics, but from a practical perspective, complexity descriptors tend to have limited use, synthetic feasibility being one case. (Indeed, there is an ongoing argument about whether topological 2D descriptors are useful and much of the discussion depends on the context). All the points that Aaron notes are correct: induction on small examples, lack of a formal framework for comparison, limited explanation of the utility. Indeed, these comments can be applied to many cheminformatics research reports (cf. “my FANCY-METHOD model performed 5% better on this dataset” style papers).

But this brings me to my main point – many of the real problems addressed by cheminformatics cannot be completely (usefully) abstracted away from the underlying chemistry and biology. Yes, a proof of the lower bounds on the calculation of a molecular complexity descriptor is interesting; maybe it’d get you a paper in a TCS journal. However, it is of no use to a practising chemist in deciding what molecule to make next. The key thing is that one can certainly start with a chemical graph, but in the end it must be tied back to the actual chemical  & biological problem. There are certainly examples of this such as the evaluation of bounds on fingerprint similarity (Swamidass & Baldi, 2007). I believe that this stresses the need for real collaborations between TCS, cheminformatics and chemistry.

As another example, Aaron uses the similarity principle (Martin et al, 2002) to explain how cheminformatics measures similarity in different ways and the nature of problems tacked by cheminformatics. One anonymous commenter responds

… I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous …

In my opinion, the commenter has never worked on real chemical problems, or is of the belief that chemistry can be abstracted into some “pure” framework, divorced from reality. The fact of the matter is that, from a physical point of view, similar molecules do in many cases exhibit similar behaviors. Conversely, there are many cases where similar molecules exhibit significantly different behaviors (Maggiora, 2006). But this is reality and is what cheminformatics must address. In other words, cheminformatics in the absence of chemistry is just symbols on paper.

Aaron, as well as number of commenters, notes that one of the reasons holding back cheminformatics is public access to data and tools. For data, this was indeed the case for a long time. But over the last 10 years or so, a number of large public access databases have become available. While one can certainly argue about the variability in data quality, things are much better than before. In terms of tools, open source cheminformatics tools are also relatively recent, from around 2000 or so. But, as I noted in the comment thread, there is a plethora of open source tools that one can use for most cheminformatics computations, and in some areas are equivalent to commercial implementations.

My last point, which is conjecture on my part, is that one reason for the higher profile of bioinformatics in the CS community is that is has a relatively lower barrier to entry for a non-biologist (and I’ll note that this is likely not a core reason, but a reason nonetheless). After all, the bulk of bioinformatics revolves around strings. Sure there are topics (protein structure etc) that are more physical and I don’t want to go down the semantic road of what is and what is not bioinformatics. But my experience as a faculty member in a department with both cheminformatics and bioinformatics, seems to suggest to me that, coming from a CS or math background, it is easier to get up to speed on the latter than the former. I believe that part of this is due to the fact that while both cheminformatics and bioinformatics are grounded in common, abstract data structures (sequences, graphs etc), one very quickly runs into the nuances of chemical structure in cheminformatics. An alternative way to put it is that much of bioinformatics is based on a single data type – properties of sequences. On the other hand, cheminformatics has multiple data types (aka structural representations) and which one is best for a given task is not always apparent. (Steve Salzberg also made a comment on the higher profile of bioinformatics, which I’ll address in an upcoming post).

In summary, I think Aarons post was very useful as an attempt at bridge building between two communities. Some aspects could have been better articulated – but the fact is, CS topics have been a core part of cheminformatics for a long time and there are ample problems yet to be tackled.

Written by Rajarshi Guha

February 13th, 2011 at 7:45 pm

Working with Sequences in R

with one comment

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Written by Rajarshi Guha

October 20th, 2010 at 10:11 pm

Posted in bioinformatics,software

Tagged with , ,