# So much to do, so little time

Trying to squeeze sense out of chemical data

## Exploring ChEMBL Targets with Neo4j

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

### Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

 123 $mvn package$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \        -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

 12 MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})   RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

Lets get more specific, and extract the kinase portion of the classification hierarchy

 1234 MATCH (n {TargetType:'TargetFamily'}),       (m {TargetID:'Kinase'}),       p = shortestPath( (n)-[:ChildOf*]->(m) )   RETURN p

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

 1234 MATCH (m {TargetType:'MolecularTarget'}),       (n {TargetID:'Kinase'}),       p = shortestPath( (m)-[*]->(n) )   RETURN m

Or identify the target classes that are linked to more than 25 proteins

 1234 MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})   WITH m, COUNT(r1) AS relCount   WHERE relCount > 25   RETURN m

which gives us a table of target classes and counts, part of which is shown below

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm

## Summarizing Collections of Curves

I was browsing live notes from the recent IEEE conference on visualization and came across a paper about functional boxplots. The idea is an extension of the boxplot visualization (shown alongside), to a set of functions. Intuitively, one can think of a functional box plot as specific envelopes for a set of functions. The construction of this plot is based on the notion of band depth (see the more general concept of data depth) which is a measure of how far a given function is from the collection of functions. As described in Sun & Genton the band depth for a given function can be computed by randomly selecting $$J$$ functions and identifying wether the given function is contained within the minimum and maximum of the $$J$$ functions. Repeating this multiple times, the fraction of times that the given function is fully contained within the $$J$$ random functions gives the band depth, $$BD_j$$. This is then used to order the functions, allowing one to compute a 50% band, analogous to the IQR in a traditional boxplot. There are more details (choice of $$J$$, partial bounding, etc.) described in the papers and links above.

My interest in this approach was piqued since one way of summarizing a dose response screen, or comparing dose response data across multiple conditions is to generate a box plot of a single curve fit parameter – say, $$\log IC_{50}$$. But if we wanted to consider the curves themselves, we have a few options. We could simply plot all of them, using translucency to avoid a blob. But this doesn’t scale visually. Another option, on the left, is to draw a series of box plots, one for each dose, and then optionally join the median of each boxplot giving a “median curve”. While these vary in their degree of utility, the idea of summarizing the distribution of a set of curves, and being able to compare these distributions is attractive. Functional box plots look like a way to do this. (A cool thing about functional boxplots is that they can be extended to multivariate functions such as surfaces and so on. See Mirzargar et al for examples)

Computing $$BD_j$$ can be time consuming if the number of curves is large or $$J$$ is large. Lopez-Pintado & Jornsten suggest a simple optimization to speed up this step, and for the special case of $$J = 2$$, Sun et al proposed a ranking based procedure that scales to thousands of curves. The latter is implemented in the fda package for R which also generates the final functional box plots.

As an example I considered 6 cell proliferation assays run in dose response, each one running the same set of compounds, but under different growth conditions. For each assay I only considered good quality curves (giving from 349 to 602 curves). The first plot compares the actives identified in the different growth conditions using the $$\log IC_{50}$$, and indicates a statistically significant increase in potency in the last three conditions compared to the first three.

In contrast, the functional box plots for the 6 assays, suggest a somewhat different picture (% Response = 100 corresponds to no cell kill and 0 corresponds to full cell kill).

The red dashed curves correspond to outliers and the blue lines correspond to the ‘maximum’ and ‘minimum’ curves (analogous to the whiskers of the traditional boxplot). Importantly, these are not measured curves, but instead correspond to the dose-wise maximum (and minimum) of the real curves. The pink region represents 50% of the curves and the black line represents the (virtual) median curve. In each case the X-axis corresponds to dose (unlabeled to save space). Personally, I think this visualization is a little cleaner than the dose-wise box plot shown above.

The mess of red lines in the plot 1 suggest an issue with the assay itself. While the other plots do show differences, it’s not clear what one can conclude from this. For example, in the plot for 4, the dip on the left hand side (i.e., low dose) could suggest that there is a degree of cytotoxicity, which is comparatively less in 3, 5 and 6. Interestingly none of the median curves are really sigmoidal, suggesting that the distribution of dose responses has substantial variance.

Written by Rajarshi Guha

November 30th, 2014 at 3:02 pm

## Exploring co-morbidities in medical case studies

A previous post described a first look at the data available in casesdatabase.com, primarily looking at summaries of high level meta-data. In this post I start looking at the cases themselves. As I noted previously, BMC has performed some form of biomedical entity recognition on the abstracts (?) of the case studies, resulting in a set of keywords for each case study. The keywords belong to specific types such as Condition, Medication and so on. The focus of this post will be to explore the occurrence of co-morbidities – which conditions occur together, to what extent and whether such occurrences are different from random. The code to extract the co-morbidity data and generate the analyses below is available in co-morbidity.py

Before doing any analyses we need to do some clean up of the Condition keywords. This includes normalizing terms (replacing ‘comatose’ with ‘coma’, converting all diabetes variants such as Type 1 and Type 2 to just diabetes), fixing spelling variants (replacing ‘foetal’ with ‘fetal’), removing stopwords and so on. The Python code to perform this clean up requires that we manually identify these transformations. I haven’t done this rigorously, so it’s not a totally cleansed dataset. The cleanup code looks like

 12345678910111213141516171819202122232425262728 def cleanTerms(terms):     repMap = {'comatose':'coma',               'seizures':'seizure',               'foetal':'fetal',               'haematomas':'Haematoma',               'disorders':'disorder',               'tumour':'tumor',               'abnormalities':'abnormality',               'tachycardias':'tachycardias',               'lymphomas': 'lymphoma',               'tuberculosis':'tuberculosis',               'hiv':'hiv',               'anaemia':'anemia',               'carcinoma':'carcinoma',               'metastases':'metastasis',               'metastatic':'metastasis',               '?':'-'}     stopwords = ['state','syndrome'', low grade', 'fever', 'type ii', 'mellitus', 'type 2', 'type 1', 'systemic', 'homogeneous', 'disease']     l = []     term = [x.lower().strip() for x in terms]     for term in terms:         for sw in stopwords: term = term.replace(sw, '')         for key in repMap.keys():             if term.find(key) >= 0: term = repMap[key]         term = term.encode("ascii", "ignore").replace('\n','').strip()         l.append(term)     l = filter(lambda x: x != '-', l)     return(list(set(l)))

Since each case study can be associated with multiple conditions, we generate a set of unique condition pairs for each case, and collect these for all 28K cases I downloaded previously.

 12345678910 cases = pickle.load(open('cases.pickle')) allpairs = [] for case in cases:     ## get all conditions for this case     conds = filter(lambda x: x['type'] == 'Condition', [x for x in case['keywords']])     conds = cleanTerms([x['text'] for x in conds])     if len(conds) == 0: continue     conds.sort()     pairs = [ (x,y) for x,y in list(combinations(conds, 2))]     allpairs.extend(pairs)

It turns out that across the whole dataset, there are a total of 991,466 pairs of conditions corresponding to 576,838 unique condition pairs and 25,590 unique conditions. Now, it’s clear that some condition pairs may be causally related (some of which are trivial cases such as cough and infection), whereas others are not. In addition, it is clear that some condition pairs are related in a semantic, rather than causal, fashion – carcinoma and cancer. In the current dataset we can’t differentiate between these classes. One possibility would be to code the conditions using ICD10 and collapse terms using the hierarchy.

Number of co-morbidities vs frequency of occurrence

Having said that, we work with what we currently have – and it’s quite sparse. In fact the 28K case studies represent just 0.16% of all possible co-morbidities. Within the set of just under 600K unique observed co-morbidities, the bulk occur just once. For the rest of the analysis we ignore these singleton co-morbidities (leaving us with 513,997  co-morbidities). It’s interesting to see the distribution of frequencies of co-morbidities. The first figure plots the number of co-morbidities that occur at least N times – 99,369 co-morbidities occur 2 or more times in the dataset and so on.

Another way to visualize the data is to plot a pairwise heatmap of conditions. For pairs of conditions that occur in the cases dataset we can calculate the probability of occurrence (i.e., number of times the pair occurs divided by the number of pairs). Furthermore, using a sampling procedure we can evaluate the number of times a given pair would be selected randomly from the pool of conditions. For the current analysis, I used 1e7 samples and evaluated the probability of a co-morbidity occurring by chance. If this probability is greater than the observed probability I label that co-morbidity as not different from random (i.e., insignificant). Ideally, I would evaluate a confidence interval or else evaluate the probability analytically (?).

For the figure below, I considered the 48 co-morbidities (corresponding to 25 unique conditions) that occurred 250 or more times in the dataset. I display the lower triangle for the heatmap – grey indicates no occurrences for a given co-morbidity and white X’s identify co-morbidities that have a non-zero probability of occurrence but are not different from random. As noted above, some of these pairs are not particularly informative – for example, tumor and metastasis occur with a relatively high probability, but this is not too surprising

Probability of occurrence for co-morbidities occurring more than 250 times

It’s pretty easy to modify co-morbidity.py to look at other sets of co-morbidities. Ideally, however, we would precompute probabilities for all co-morbidities and then support interactive visualization (maybe using D3).

It’s also interesting to look at co-morbidities that include a specific condition. For example, lets consider tuberculosis (and all variants). There are 948 unique co-morbidities that include tuberculosis as one of the conditions. While the bulk of them occur just twice, there are a number with relatively large frequencies of occurrence – lymphadenopathy co-occurs with tuberculosis 203 times. Rather than tabulate the co-occurring conditions, we can use the frequencies to generate a word cloud, as shown below. As with the co-morbidity heatmaps, this could be easily automated to support interactive exploration. On a related note, it’d be quite interesting to compare the frequencies discussed here with data extracted from a live EHR system

A visualization of conditions most frequently co-occurring with tuberculosis

So far this has been descriptive – given the size of the data, we should be able to try out some predictive models. Future posts will look at the possibilities of modeling the case studies dataset.

Written by Rajarshi Guha

October 12th, 2013 at 10:43 pm

## Exploring medical case studies

with one comment

I recently came across http://www.casesdatabase.com/ from BMC, a collection of more than 29,000 peer-reviewed case studies collected from a variety of journals. I’ve been increasingly interested in the possibilities of mining clinical data (inspired by impressive work from Atul Butte, Nigam Shah and others), so this seemed like a great resource to explore

The folks at BMC have provided a REST API, which is still in development – as a result, there’s no public documentation and it still has a few rough edges. However, thanks to help from Demitrakis Kavallierou, I was able to interact with the API and extract summary search information as well as 28,998 case studies as of Sept 23, 2013. I’ve made the code to extract case studies available as proc.py. Running this, gives you two sets of data.

1. A JSON file for each year between 2000 and 2014, containing the summary results for all cases in that year which includes a summary view of the case, plus facets for a variety of fields (age, condition, pathogen, medication, intervention etc.)
2. A pickle file containing the case reports, as a list of maps. The case report contains the full abstract, case report identifier and publication meta-data.

A key feature of the case report entries is that BMC has performed some form of entity recognition so that it provides a list of keywords identified by different types: ‘Condition’, ‘Symptom’, ‘Medication’ etc. Each case may have multiple occurences for each type of keyword and importantly, each keyword is associated with the text fragment it is extracted from. As an example consider case 10.1136/bcr.02.2009.1548. The entry extracts two conditions

 123 {u'sentence': u'She was treated by her family physician for presumptive interscapular myositis with anti-inflammatory drugs, cold packs and rest.', u'text': u'Myositis', u'type': u'Condition'}

and

 123 {u'sentence': u'The patient denied any constitutional symptoms and had no cough.', u'text': u'Cough', u'type': u'Condition'}

I’m no expert in biomedical entity recognition, but the fact that BMC has performed it, saves me from having to become one, allowing me to dig into the data. But there are the usual caveats associated with text mining – spelling variants, term variants (insulin and insulin therapy are probably equivalent) and so on.

Count of cases deposited per year

However, before digging into the cases themselves, we can use the summary data, and especially the facet information (which is, by definition, standardized) to get some quick summaries from the database. For example we see the a steady increase of case studies deposited in the literature over the last decade or so.

Interestingly, the number of unique conditions, medications or pathogens reported for these case studies is more or less constant, though there seems to be a downward trend for conditions. The second graph highlights this trend, by plotting the number of unique facet terms (for three types of facets) per year, normalized by the number of cases deposited that year.

Normalized count of unique facet terms by year

This is a rough count, since I didn’t do any clean up of the text – so that misspellings of the same term (say, acetaminophen and acetaminaphen will be counted as two separate medication facets) may occur.

Another interesting task would be to enrich the dataset with additional annotations – ICD9/ICD10 for conditions, ATC for drugs – which would allow a higher level categorization and linking of case studies. In addition, one could use the CSLS service to convert medication names to chemical structures and employ structural similarity to group case studies.

The database also records some geographical information for each case. Specifically, it lists the countries that the authors are from. While interesting to an extent, it would have been nice if the country of occurrence or country of treatment were specifically extracted from the text. Currently, one might infer that the treatment occurred in the same country as the author is from, but this is likely only true when all authors are from the same country. Certainly, multinational collaborations will hide the true number of cases occurring in a given country (especially so for tropical diseases).

But we can take a look at how the number of cases reported for specific conditions, varies with geography and time. The figure below shows the cases whose conditions included the term tuberculosis

Tuberculosis cases by country and year

The code to extract the data from the pickle file is in condition_country.py. Assuming you have cases.pickle in your current path, usage is

 1 \$ python condition_country.py condition_name

and will output the data into a CSV file, which you can the process using your favorite tools.

In following blog posts, I’ll start looking at the actual case studies themselves. Interesting things to look at include exploring the propensity of co-morbidities, analysing the co-occurrence of conditions and medications or conditions and pathogens, to see whether the set of treatments associated with a given condition (or pathogen) has changed over time. Both these naturally lead to looking at the data with eye towards repurposing events.

Written by Rajarshi Guha

October 10th, 2013 at 7:20 pm

## Notes & thoughts from the IU semantics workshop

Over the last two days I attended a workshop titled Exploiting Big Data Semantics for Translational Medicine, held at Indiana University, organized by David Wild, Ying Ding, Katy Borner and Eric Gifford. The stated goals were to explore advances in translation medicine via data and semantic technologies, with a view towards possible fundable ideas and funding opportunites. A nicely arranged workshop that was pretty intense – minimal breaks, constant thinking – which is a good use of 2 days. As you can see from the workshop website, the attendees brought a variety of skills and outlooks to the meeting. For me this was one of the most attractive features of the workshop.

This post is a rough dump of some observations & thoughts during the workshop – I’m sure I’ve left out important comments, provide minimal attribution and I assume there will be a more thorough report coming out from the organizers. I also point out that I am an interested bystander to this field and somewhat of a semantic web/technology (SW/T) skeptic – so some views may be naive or just wrong. I like the ideas and concepts, I can see their value, but I have not been convinced to invest significant time and efort into “semantifying” my day to day work. A major motivation for my attending this workshop was to learn what the experts are doing and see how I could incorporate some of these ideas into my own work.

## The Meeting

The first day started with 5 minute introductions which was quite useful and great overview talks by three of the attendees. With the information dump, a major focus of the day was a discussion of opportunities and challenges. This was a very useful session with attendees listing specific instances of challenges, opportunities, bottlenecks and so on. I was able to take some notes on the challenges, including

• Funding – lack of it and difficulty in obtaining it (i.e., persuading funders)
• Cultural and social issues around semantic approaches (e.g., why change what’s already working? etc)
• Data problems such as errors being propagated through ontologies and semantic conversion processes etc (I wonder to what extent this is a result of automated conversion processes such as D2R, versus manual errors introduced during curation. I suspect a mix of both)
• Hilbert Problems” – a very nice term coined by Katy to represent grand challenges or open problems that could serve as seeds around which the community could nucleate. (This aspect was of particular interest as I have found it difficult to identify compelling life science use cases that justify a retooling (even partial) of current workflows.)

The second day focused on breakout sessions, based on the opportunities and challenges listed the day before. Some notes on some of these sessions:

Bridging molecular data and clinical data – this session focused on challenges and opportunities in using molecular data together with clinical data to inform clinical decision making. Three broad opportunies came out of this, viz., Advancing understanding of disease conditions, Optimizing data types/measurements for clinical decision making outcomes and Drug repurposing. Certainly very broad goals, and not particularly focused on SW/T. My impression that SW/T can play an important role in standardization and optimization of coding standards to more easily and robustly connect molecular and clinical data sources. But one certainly needn’t invoke SW/T to address these opportunities

Knowledge discovery – the considerations addressed by this group included the fact that semantified data (vocabularies, ontologies etc) is increasing in volume and availability, tools are available to go from raw data to semantified forms and so on. An important point was made the quality is a key consideration at multiple levels – the raw data, the semantic representation and the links between semantic entities. A challenge identified by this group was to identify use cases that SW/T can resolve and traditional technologies cannot.

RDBMS vs semantic databases – this was an interesting session that tried to address the question of when one type of database is better than the other. It seems that the consensus was that certain problems are better suited for one type over another and a hybrid solution is usually a sensible approach – but that goes without saying. A comment was made that certain classes of problems that involve identifying paths between terms (nodes) are better suited for semantic (graph) databases – this makes intuitive sense, but there was a consensus that there weren’t any realistic applications that one could point to. I like the idea – have attributes in a RDBMS, but links in a graph database and use graph queries to identify relations and entities that are then mapped to the RDBMS. My concern with this is that path traversals are easy (Neo4J does this quite efficiently) – the problem is in the explosion of possible paths between nodes and the fact that the majority of them are trivial at best or nonsensical at worst. This suggests that relevance/ranking is a concern in semantic/graph databases.

The session of most interest to me was that of grand challenges. I think we got to 5  or 6 major challenges

• How to represent knowledge (methods for, evaluation of)
• How do changes in ontologies affect scientific research over time
• How to construct an ontology from a set of ontologies (i.e. preexisting knowledge) that is better than the individual ones (and so links to how to evaluate an ontology in terms of “goodness”)
• Error propagation from measurements to representation to analysis
• Visualization of multi-dimensional / high dimensional data – while a general challenge, I think it’s correct that visual representations of semantified data (and their supporting infrastructure such as ontologies) could make the methods and tools much more accessible. Would’ve been nice if we had more discussion on this aspect

We finally ended with a discussion concrete projects that attendees would be interested in collaborating on and this was quite fruitful.

## My Opinion

It turns out that a good chunk of the discussion focused on translational medicine (clinical informatics, drug repurposing etc.) and the use of different data types to enable life science research, but largely independent of SW/T. Indeed, the role of SW/T seemed rather fuzzy at times – to some extent, a useful tool, but not indispensable. My impression was that much of the SW/T that was discussed really focused on labeling of knowledge via ontologies and making links between datasets and the challenges faced during these operations (which is fine and important – but does it justify funding?).

I certainly got some conflicting views of the state of the art. Comments from Amit Sheth made it appear that SW/T is well established and the main problems are solved, based on deployed applications in the “enterprise”. But comments from many of the attendes working in life sciences suggested many problems in dealing and working with semantic data. Sure, Google has it’s Knowledge Graph and other search engines are employing SW/T under the hood. But if it’s so well established, where are the products, tools and workflows that an informatics-savvy non-expert in SW/T can employ? Does this mean research funding is not really required and it’s more of a productization/monetization issue? Or is this a domain specific issue – what works for general search doesn’t necessarily work in the life sciences?

My fundamental issue is the absence of a “killer application” – an application or use case that gives a non-trivial result, that could not be achieved via traditional means. (I qualify this, by asking for such use cases in life sciences. Maybe bankers have already found their killer applications). Depending on the semantic technology one considers there are partial answers: ontologies are an example of such a use case, when used to enable linkages between datasets and sources across domains. To me this makes perfect sense (and is of particular interest and use in current projects such as BARD). But surely, there must be more than designing ontologies and annotating data with ontological terms? One of the things that was surprising to me was that some of the future problems that were considered for possible collaborations were not really dependent on SW/T – in other words, they could largely be addressed via pre-existing methodologies.

My (admittedly cursory) reading of the SW/T literature seems to suggest to me that a major promise of this field is “reasoning” over my data. And I’m waiting for non-trivial assertions made based on linked data, ontologies and so on – that really highlight where my SQL tables will fail. It’s not sufficient (to me) to say that what took me 50 lines of Python code takes you 2 lines of SPARQL – I have an investment made in my RDBMS, API’s and codebase and yes, it takes a bit more fiddling – but I can get my answer in 5 minutes because it’s already been set up.

Some points were made regarding challenges faced by SW/T including complexity of OWL, difficulty in leaning SPARQL, poor performanec queries. Personally, these are not valid challenges and I certainlly do not make the claim that tricky SPARQL queries are preventing me from jumping into SW/T. I’m perfectly willing to wait 5 min for a SPARQL query to run, if the outcome is of sufficient value. The bigger issue for me is the value of the outcomes – maybe it’s just too early for truly novel, transformative results to be produced. Or maybe it’s simply one tool amongst others that can be used to tackle a certain class of problems.

Overall, it was a worthwhile two days interacting with a group of interesting people. But definitely some fuzziness in terms of what role SW/T can, should or will play in translational life science research.

Written by Rajarshi Guha

March 27th, 2013 at 7:26 pm