So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘bioinformatics’ Category

Exploring medical case studies

with one comment

I recently came across http://www.casesdatabase.com/ from BMC, a collection of more than 29,000 peer-reviewed case studies collected from a variety of journals. I’ve been increasingly interested in the possibilities of mining clinical data (inspired by impressive work from Atul Butte, Nigam Shah and others), so this seemed like a great resource to explore

The folks at BMC have provided a REST API, which is still in development – as a result, there’s no public documentation and it still has a few rough edges. However, thanks to help from Demitrakis Kavallierou, I was able to interact with the API and extract summary search information as well as 28,998 case studies as of Sept 23, 2013. I’ve made the code to extract case studies available as proc.py. Running this, gives you two sets of data.

  1. A JSON file for each year between 2000 and 2014, containing the summary results for all cases in that year which includes a summary view of the case, plus facets for a variety of fields (age, condition, pathogen, medication, intervention etc.)
  2. A pickle file containing the case reports, as a list of maps. The case report contains the full abstract, case report identifier and publication meta-data.

A key feature of the case report entries is that BMC has performed some form of entity recognition so that it provides a list of keywords identified by different types: ‘Condition’, ‘Symptom’, ‘Medication’ etc. Each case may have multiple occurences for each type of keyword and importantly, each keyword is associated with the text fragment it is extracted from. As an example consider case 10.1136/bcr.02.2009.1548. The entry extracts two conditions

1
2
3
{u'sentence': u'She was treated by her family physician for presumptive interscapular myositis with anti-inflammatory drugs, cold packs and rest.',
u'text': u'Myositis',
u'type': u'Condition'}

and

1
2
3
{u'sentence': u'The patient denied any constitutional symptoms and had no cough.',
u'text': u'Cough',
u'type': u'Condition'}

I’m no expert in biomedical entity recognition, but the fact that BMC has performed it, saves me from having to become one, allowing me to dig into the data. But there are the usual caveats associated with text mining – spelling variants, term variants (insulin and insulin therapy are probably equivalent) and so on.

Count of cases deposited per year

Count of cases deposited per year

However, before digging into the cases themselves, we can use the summary data, and especially the facet information (which is, by definition, standardized) to get some quick summaries from the database. For example we see the a steady increase of case studies deposited in the literature over the last decade or so.

Interestingly, the number of unique conditions, medications or pathogens reported for these case studies is more or less constant, though there seems to be a downward trend for conditions. The second graph highlights this trend, by plotting the number of unique facet terms (for three types of facets) per year, normalized by the number of cases deposited that year.

Normalized count of unique facet terms by year

Normalized count of unique facet terms by year

This is a rough count, since I didn’t do any clean up of the text – so that misspellings of the same term (say, acetaminophen and acetaminaphen will be counted as two separate medication facets) may occur.

Another interesting task would be to enrich the dataset with additional annotations - ICD9/ICD10 for conditions, ATC for drugs – which would allow a higher level categorization and linking of case studies. In addition, one could use the CSLS service to convert medication names to chemical structures and employ structural similarity to group case studies.

The database also records some geographical information for each case. Specifically, it lists the countries that the authors are from. While interesting to an extent, it would have been nice if the country of occurrence or country of treatment were specifically extracted from the text. Currently, one might infer that the treatment occurred in the same country as the author is from, but this is likely only true when all authors are from the same country. Certainly, multinational collaborations will hide the true number of cases occurring in a given country (especially so for tropical diseases).

But we can take a look at how the number of cases reported for specific conditions, varies with geography and time. The figure below shows the cases whose conditions included the term tuberculosis

Tuberculosis cases by country and year

Tuberculosis cases by country and year

The code to extract the data from the pickle file is in condition_country.py. Assuming you have cases.pickle in your current path, usage is

1
$ python condition_country.py condition_name

and will output the data into a CSV file, which you can the process using your favorite tools.

In following blog posts, I’ll start looking at the actual case studies themselves. Interesting things to look at include exploring the propensity of co-morbidities, analysing the co-occurrence of conditions and medications or conditions and pathogens, to see whether the set of treatments associated with a given condition (or pathogen) has changed over time. Both these naturally lead to looking at the data with eye towards repurposing events.

Written by Rajarshi Guha

October 10th, 2013 at 7:20 pm

Cheminformatics and Hotness (or lack thereof)

with 2 comments

A few days back I discussed some thoughts on cheminformatics vis a vis bioinformatics, inspired by a review by Aaron Sterling. In that thread, Steven Salzberg made a comment, stating

… In my opinion, it is not a “hot” field, though, in part for some of the reasons mentioned in the post – particularly the fact that the data in the field is mostly proprietary and/or secret. So they hurt themselves by that behavior. But the other reason I don’t think it is moving that fast is that, unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological advances. In bioinformatics, the amazing progress in automated DNA sequencing has driven the science forward at a tremendous pace …

I agree with Steven and others that cheminformatics is not as “hot” as bioinformatics, based on varying metrics of hotness (groups, publications, funding, etc.). However I think the perceived lack of popularity stems from a number of reasons and that technological pushes are a minor reason. (Andrew Dalke noted some of these in a comment).

1. Lack of publicaly accessible data – this has been mentioned in various places and I believe is a key reason that held back the development of cheminformatics outside industry. This is not to say that academic groups weren’t doing cheminformatics in 70′s and 80′s, but we could’ve had a much richer ecosystem.

In this vein, it’s also important to note that just public structure data, while necessary, would likely not have been sufficient for cheminformatics developemnt. Rather, structure and biological activity data are both requred for the development of novel cheminformatics methodologies. (Of course certain aspects of cheminformatics are are focused purely on chemical structure, and as such do fine in the abensce of publically accesssible activity data).

2. Small molecules can make money directly – this is a primary driver for the previous point. A small molecule with confirmed activity against a target of interest can make somebody a lot of money. It’s not just that one molecule – analogs could be even more potent. As a result, the incentive to hold back swathes of structure and activity data is the financially sensible approach. (Whether this is actually useful is open to debate). On the other hand, sequence data is rarely commercialiable (though use of the sequence could be) and hence much easier to release.

3. Burden of knowledge – as I mentioned in my previous post, I believe that to make headway in many areas of cheminformatics requires some background in chemistry, sincce mathematical abstractions (cf graph representations) only take you so far. As Andrew noted, “Bioinformatics has an “overarching mathematical theory” because it’s based very directly on evolution, encoded in linear sequences“. As a result the theoretical underpinnings of much of bioinformatics make it more accessible to the broader community of CS and mathematics. This is not to say that new mathematical developments are not possible in cheminformatics – it’s just a much more complex topic to tackle.

4. Lack of federal funding – this is really a function of the above three points. The idea that it’s all been done in industry is something I’ve heard before at meetings. Obviously, with poor or no federal funding opportunities, fewer groups see cheminformatics as a “rewarding” field. While I still think the NIH’s cancellation of the ECCR program was pretty dumb, this is not to say that there is no federal funding for cheminformatics. Applications just have to be appropriately spun.

To address Stevens’ point regarding technology driving the science – I disagree. While large scale synthesis is possible in some ways (such as combinatorial libraries, diversity oriented synthesis etc.), just making large numbers of molecules is not really a solution. If it were, we might as well generate them virtually and work from the SMILES.

Instead, what is required is large scale activity measurements. And there have been technology developments that allow one to generate large amounts of structure-actvity data – namely, High Throughput Screening (HTS) technologies. Admittedly, the data so generated is not near the scale of sequencing – but at the same time, compared to sequencing, every HTS project usually requires some form of unique optimization of assay conditions. Added to that, we’re usually looking at a complex system and not just a nucleotide sequence and it’s easy to see why HTS assays are not going to be at the scale of next gen sequencing.

But, much of this technology was relegated to industry. It’s only in the last few years that HTS technology has been accesible outside industry and efforts such as the Molecular Libraries Initiative have made great strides in getting HTS technologies to academics and more importantly, making the results of these screens publicaly available.

As a bioinformatics bystander, while I see reports of next gen sequencing pushing out GBs and TBs of data and hence the need for new bioinformatics methods – I don’t see a whole lot of “new” bioinformatics. To me it seems that its just variations of putting together sequences faster – which seems a rather narrow area, if that’s all that is being pushed by these technological developments. (I have my asbestos underwear on, so feel free to flame)

Certainly, bioinformatics is helped by high profile projects such as the Human Genome Project and the more recent 1000 Genomes project which certainly have great gee-whiz factors.  What might be an equivalent for cheminformatics? I’m not sure – but I’d guess something on the lines of systems biology or systems chemical biology might be a possibility.

Or maybe cheminformatics just needs to become “small molecule bioinformatics”?

Written by Rajarshi Guha

February 22nd, 2011 at 1:51 am

Cheminformatics – the New World for TCS?

with 4 comments

A few weeks back Aaron Sterling posted a review of the Handbook of Cheminformatics Algorithms (in which I have a chapter). Aaron notes

… my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry …

The review/bridging was a pretty thorough summary of the book, but the blog post as well as the comments raised a number of interesting issues that I think are worth discussing. Aaron notes

… Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind …

While the HCA may be the first compilation of cheminformatics-related algorithms in a single place, cheminformatics actually has a pretty long lineage, starting back in the 1960′s. Examples include canonicalization (Morgan, 1965) and ring perception (Hendrickson, 1961). See here for a short history of cheminformatics. Granted these are not CS journals, but that doesn’t mean that cheminformatics is a new field. Bioinformatics also seems to have a similar lineage (see this Biostar thread) with some seminal papers from the 1960′s (Dayhoff et al, 1962). Interestingly, it seems that much of the most-cited literature (alignments etc.) in bioinformatics comes from the 90′s.

Aaron then goes onto note that “there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA“. In some ways this is correct – a number of cheminformatics topics could be considered ad-hoc, rather than grounded in rigorous mathematical proofs. But there are topics, primarily in the graph theoretical areas, that are pretty rigorous. I think Aarons choice of complexity descriptors as an example is not particularly useful – granted it is easy to understand without a background in cheminformatics, but from a practical perspective, complexity descriptors tend to have limited use, synthetic feasibility being one case. (Indeed, there is an ongoing argument about whether topological 2D descriptors are useful and much of the discussion depends on the context). All the points that Aaron notes are correct: induction on small examples, lack of a formal framework for comparison, limited explanation of the utility. Indeed, these comments can be applied to many cheminformatics research reports (cf. “my FANCY-METHOD model performed 5% better on this dataset” style papers).

But this brings me to my main point – many of the real problems addressed by cheminformatics cannot be completely (usefully) abstracted away from the underlying chemistry and biology. Yes, a proof of the lower bounds on the calculation of a molecular complexity descriptor is interesting; maybe it’d get you a paper in a TCS journal. However, it is of no use to a practising chemist in deciding what molecule to make next. The key thing is that one can certainly start with a chemical graph, but in the end it must be tied back to the actual chemical  & biological problem. There are certainly examples of this such as the evaluation of bounds on fingerprint similarity (Swamidass & Baldi, 2007). I believe that this stresses the need for real collaborations between TCS, cheminformatics and chemistry.

As another example, Aaron uses the similarity principle (Martin et al, 2002) to explain how cheminformatics measures similarity in different ways and the nature of problems tacked by cheminformatics. One anonymous commenter responds

… I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous …

In my opinion, the commenter has never worked on real chemical problems, or is of the belief that chemistry can be abstracted into some “pure” framework, divorced from reality. The fact of the matter is that, from a physical point of view, similar molecules do in many cases exhibit similar behaviors. Conversely, there are many cases where similar molecules exhibit significantly different behaviors (Maggiora, 2006). But this is reality and is what cheminformatics must address. In other words, cheminformatics in the absence of chemistry is just symbols on paper.

Aaron, as well as number of commenters, notes that one of the reasons holding back cheminformatics is public access to data and tools. For data, this was indeed the case for a long time. But over the last 10 years or so, a number of large public access databases have become available. While one can certainly argue about the variability in data quality, things are much better than before. In terms of tools, open source cheminformatics tools are also relatively recent, from around 2000 or so. But, as I noted in the comment thread, there is a plethora of open source tools that one can use for most cheminformatics computations, and in some areas are equivalent to commercial implementations.

My last point, which is conjecture on my part, is that one reason for the higher profile of bioinformatics in the CS community is that is has a relatively lower barrier to entry for a non-biologist (and I’ll note that this is likely not a core reason, but a reason nonetheless). After all, the bulk of bioinformatics revolves around strings. Sure there are topics (protein structure etc) that are more physical and I don’t want to go down the semantic road of what is and what is not bioinformatics. But my experience as a faculty member in a department with both cheminformatics and bioinformatics, seems to suggest to me that, coming from a CS or math background, it is easier to get up to speed on the latter than the former. I believe that part of this is due to the fact that while both cheminformatics and bioinformatics are grounded in common, abstract data structures (sequences, graphs etc), one very quickly runs into the nuances of chemical structure in cheminformatics. An alternative way to put it is that much of bioinformatics is based on a single data type – properties of sequences. On the other hand, cheminformatics has multiple data types (aka structural representations) and which one is best for a given task is not always apparent. (Steve Salzberg also made a comment on the higher profile of bioinformatics, which I’ll address in an upcoming post).

In summary, I think Aarons post was very useful as an attempt at bridge building between two communities. Some aspects could have been better articulated – but the fact is, CS topics have been a core part of cheminformatics for a long time and there are ample problems yet to be tackled.

Written by Rajarshi Guha

February 13th, 2011 at 7:45 pm

Working with Sequences in R

with one comment

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1
2
3
4
seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {
  as.character(complement(RNAString(DNAString(x))))
})

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1
2
seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

1
2
3
4
5
x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
})
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Written by Rajarshi Guha

October 20th, 2010 at 10:11 pm

Posted in bioinformatics,software

Tagged with , ,

RNAi in PubChem

without comments

While considering ways to disseminate RNAi screening data, I found out that PubChem now contains two RNAi screening datasets – AIDs 1622 and 1904. These screens reuse the PubChem bioaassay formats – which is both good and bad. For example, while there are a few standardized columns (such as PUBCHEM_ACTIVITY_SCORE), the bulk of the user deposited columns are not formally defined. In other words, you’d have to read the assay description. While not a huge deal, it would be nice if we could use pre-existing formats such as MIARE, analogous to MIAME for microarray data. That way we could determine the number of replicates, normalization method employed and other details of the screen. As far as I can tell all aspects an RNAi screen are still not fully defined in the MIAME vocabulary, and there don’t seem to be a whole lot of examples. But it’s a start.

But of course, nothing is perfect. Why, oh why, would a tab delimited format be contained within multiple worksheets of an Excel workbook!

Written by Rajarshi Guha

April 19th, 2010 at 12:19 am

Posted in bioinformatics

Tagged with , ,