## Archive for the ‘research’ Category

## Accessing Chemistry on the Web Using Firefox

With the profusion of chemical information on the web – in the form of chemical names, images of structures, specific codes (InChI etc), it’s sometimes very useful to be able to seamlessly retrieve some extra information while browsing a page that contains such entities. The usual way is to copy the InChI/SMILES/CAS/name string and paste into Pubchem, Chemspider and so on.

However, a much smoother way is now available via a Firefox extension, called NCATSFind, developed by my colleague. It’s a one click install and once installed, automatically identifies a variety of chemical id codes (CAS number, InChI, UNII) and when such entities are identified uses a variety of backend services to provide context. In addition, it has a cool feature that lets you select an image and generate a structure (using OSRA in the background).

Check out his blog post for more details.

## Retrieving Target Classifications from ChEMBL

There are a number of scenarios when it’s useful to be able to classify protein targets – high level summaries, enrichment calculations and so on. There are a variety of protein classification schemes out there such as PANTHER, SCOP and InterPro. These schemes are based on domains and other structural features. ChEMBL provides it’s own hierarchical classification. Since I use this from time to time, it’s useful to pull all the classifications for a given species, at one go via the SQL below (tested with v17):

1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT td.pref_name, description, accession, pfc . * FROM target_dictionary td, target_components tc, component_sequences cs, component_class cc, protein_family_classification pfc WHERE td.tax_id = 9606 AND td.tid = tc.tid AND tc.component_id = cs.component_id AND cc.component_id = cs.component_id AND pfc.protein_class_id = cc.protein_class_id; |

## Which Datasets Lead to Predictive Models?

I came across a recent paper from the Tropsha group that discusses the issue of *modelability* – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of *reliable* is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.

The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state

we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation

I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!

Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.

The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.

One application of MODI that does come to mind is for feature selection - identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.

More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.

## fingerprint 3.5.2 released

Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.

His patch led to a 10-fold improvement in running time. However his code involved the use of nested *for* loops in R. This is a well known bottleneck and most idiomatic R code replaces *for* loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.

## Exploring co-morbidities in medical case studies

A previous post described a first look at the data available in casesdatabase.com, primarily looking at summaries of high level meta-data. In this post I start looking at the cases themselves. As I noted previously, BMC has performed some form of biomedical entity recognition on the abstracts (?) of the case studies, resulting in a set of keywords for each case study. The keywords belong to specific types such as *Condition*, *Medication* and so on. The focus of this post will be to explore the occurrence of co-morbidities – which conditions occur together, to what extent and whether such occurrences are different from random. The code to extract the co-morbidity data and generate the analyses below is available in co-morbidity.py

Before doing any analyses we need to do some clean up of the *Condition* keywords. This includes normalizing terms (replacing *‘comatose’* with *‘coma’*, converting all diabetes variants such as Type 1 and Type 2 to just diabetes), fixing spelling variants (replacing *‘foetal’* with *‘fetal’*), removing stopwords and so on. The Python code to perform this clean up requires that we manually identify these transformations. I haven’t done this rigorously, so it’s not a totally cleansed dataset. The cleanup code looks like

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def cleanTerms(terms): repMap = {'comatose':'coma', 'seizures':'seizure', 'foetal':'fetal', 'haematomas':'Haematoma', 'disorders':'disorder', 'tumour':'tumor', 'abnormalities':'abnormality', 'tachycardias':'tachycardias', 'lymphomas': 'lymphoma', 'tuberculosis':'tuberculosis', 'hiv':'hiv', 'anaemia':'anemia', 'carcinoma':'carcinoma', 'metastases':'metastasis', 'metastatic':'metastasis', '?':'-'} stopwords = ['state','syndrome'', low grade', 'fever', 'type ii', 'mellitus', 'type 2', 'type 1', 'systemic', 'homogeneous', 'disease'] l = [] term = [x.lower().strip() for x in terms] for term in terms: for sw in stopwords: term = term.replace(sw, '') for key in repMap.keys(): if term.find(key) >= 0: term = repMap[key] term = term.encode("ascii", "ignore").replace('\n','').strip() l.append(term) l = filter(lambda x: x != '-', l) return(list(set(l))) |

Since each case study can be associated with multiple conditions, we generate a set of unique condition pairs for each case, and collect these for all 28K cases I downloaded previously.

1 2 3 4 5 6 7 8 9 10 | cases = pickle.load(open('cases.pickle')) allpairs = [] for case in cases: ## get all conditions for this case conds = filter(lambda x: x['type'] == 'Condition', [x for x in case['keywords']]) conds = cleanTerms([x['text'] for x in conds]) if len(conds) == 0: continue conds.sort() pairs = [ (x,y) for x,y in list(combinations(conds, 2))] allpairs.extend(pairs) |

It turns out that across the whole dataset, there are a total of **991,466 pairs of conditions** corresponding to **576,838 unique condition pairs** and **25,590 unique conditions**. Now, it’s clear that some condition pairs may be causally related (some of which are trivial cases such as *cough* and *infection*), whereas others are not. In addition, it is clear that some condition pairs are related in a semantic, rather than causal, fashion – *carcinoma* and *cancer*. In the current dataset we can’t differentiate between these classes. One possibility would be to code the conditions using ICD10 and collapse terms using the hierarchy.

Having said that, we work with what we currently have – and it’s quite sparse. In fact the 28K case studies represent just 0.16% of all possible co-morbidities. Within the set of just under 600K unique observed co-morbidities, the bulk occur just once. For the rest of the analysis we ignore these singleton co-morbidities (leaving us with 513,997 co-morbidities). It’s interesting to see the distribution of frequencies of co-morbidities. The first figure plots the number of co-morbidities that occur at least N times – **99,369 co-morbidities occur 2 or more times** in the dataset and so on.

Another way to visualize the data is to plot a pairwise heatmap of conditions. For pairs of conditions that occur in the cases dataset we can calculate the probability of occurrence (i.e., number of times the pair occurs divided by the number of pairs). Furthermore, using a sampling procedure we can evaluate the number of times a given pair would be selected randomly from the pool of conditions. For the current analysis, I used 1e7 samples and evaluated the probability of a co-morbidity occurring by chance. If this probability is greater than the observed probability I label that co-morbidity as not different from random (i.e., insignificant). Ideally, I would evaluate a confidence interval or else evaluate the probability analytically (?).

For the figure below, I considered the 48 co-morbidities (corresponding to 25 unique conditions) that occurred 250 or more times in the dataset. I display the lower triangle for the heatmap – grey indicates no occurrences for a given co-morbidity and white X’s identify co-morbidities that have a non-zero probability of occurrence but are not different from random. As noted above, some of these pairs are not particularly informative – for example, *tumor* and *metastasis* occur with a relatively high probability, but this is not too surprising

It’s pretty easy to modify co-morbidity.py to look at other sets of co-morbidities. Ideally, however, we would precompute probabilities for *all* co-morbidities and then support interactive visualization (maybe using D3).

It’s also interesting to look at co-morbidities that include a specific condition. For example, lets consider *tuberculosis* (and all variants). There are **948 unique co-morbidities that include tuberculosis** as one of the conditions. While the bulk of them occur just twice, there are a number with relatively large frequencies of occurrence – *lymphadenopathy* co-occurs with *tuberculosis* 203 times. Rather than tabulate the co-occurring conditions, we can use the frequencies to generate a word cloud, as shown below. As with the co-morbidity heatmaps, this could be easily automated to support interactive exploration. On a related note, it’d be quite interesting to compare the frequencies discussed here with data extracted from a live EHR system

So far this has been descriptive – given the size of the data, we should be able to try out some predictive models. Future posts will look at the possibilities of modeling the case studies dataset.