So much to do, so little time

Trying to squeeze sense out of chemical data

Predictive models – Implementation vs Specification

with one comment

Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.

But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.

Written by Rajarshi Guha

May 1st, 2013 at 12:29 am

New version of fingerprint (3.4.9) – faster Dice similarity matrices

I’ve just pushed a new version of the fingerprint package that contains an update provided by Abhik Seal that significantly speeds up calculation of pairwise similarity matrices when using the Dice similarity method. A ran a simple comparison using different numbers of random fingerprints (1024 bits, with 512 bits set to one, randomly) and measured the time to evaluate the pairwise similarity matrix. As you can see from the figure alongside, the new code is significantly faster (with speed ups of 450x to 500x). The code to generate the timings is below – it probably should wrapped in a loop to multiple times for each set size.

 12345 fpls <- lapply(seq(10,300,by=10),                function(i) sapply(1:i,                                   function(x) random.fingerprint(1024, 512))) times <- sapply(fpls,                 function(fpl) system.time(fp.sim.matrix(fpl, method='dice'))[3])

Written by Rajarshi Guha

October 30th, 2012 at 11:10 pm

Competitive Predictive Modeling – How Useful is it?

While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.

But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).

BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.

What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?

My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.

But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).

I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.

Written by Rajarshi Guha

August 22nd, 2012 at 9:02 pm

I’d Rather Be … Reverse Engineering

Gamification is a hot topic and companies such as Tunedit and Kaggle are succesfully hosting a variety of data mining competitions. These competitions employ data from a variety of domains such as bond trading, essay scoring and so on. Recently, both platforms have hosted a QSAR challenge (though not officially denoted as such). The most recent one is the challenge hosted at Kaggle by Boehringer Ingelheim.

While it’s good to see these competitions raise the profile of “data science” (and make some money for the winners), I must admit that these are not particularly interesting to me as it really boils down to looking at numbers with no context (aka domain knowledge). For example, in the Kaggle & BI example, there are 1,776 descriptors that have been normalized but no indication of the chemistry or biology. One could ask whether a certain mechanism of action is known to play a role in the biology being tested which could suggest a certain class of descriptors over another. Alternatively, one could ask whether there are a few distinct chemotypes present thus suggesting multiple local models versus a single global model. (I suppose that the supplied descriptors may lend themselves to a clustering, but a scaffold based approach would be much more direct and chemically intuitive).

This is not to say that such competitions are useless. On the contrary, lack of domain knowledge doesn’t preclude one from apply sophisticated statistical and machine learning methods to unannotated data and obtaining impressive results. The issue of data versus domain knowledge has been discussed in several places.

In contrast to the currently hosted challenge at Kaggle, an interesting twist would be to try and reverse engineer the structures from their descriptor values. There have been some previous discussions on reverse engineering structures from descriptor data. Obviously, we’re not going to be able to verify our results, but it would be an interesting challenge.

Written by Rajarshi Guha

April 6th, 2012 at 4:16 am

Words, Sentences, Fragments & Molecules

For some time I have been thinking of the analogy between linguistics (and text mining of language data) and chemistry, specifically from the point of view of fragments (though, the relationship between the two fields is actually quite long and deep, since many techniques from IR have been employed in cheminformatics). For example, atoms and bonds can be considered an “alphabet” for chemical structures. Going one level up, one can consider fragments as words, which can be joined together to form larger structures (with the linguistic analog being sentences). In a talk I gave at the ACS sometime back I compared fragments with n-grams (though LINGO‘s are probably a more direct analog).

On these lines I have been playing with text mining and modeling tools in R, mainly via the excellent tm package. One of the techniques I have been playing around with is Latent Dirichlet Allocation. This is a generative modeling approach, allowing one to associate a document (composed of a set of words) with a “topic”. Here, a topic is a group of words that have a higher probability of being generated from that topic than another topic. The technique assumes that a document is comprised of a mixture of topics – as a result, one can assign a document to different topics with different probabilities. There have been a number of applications of LDA in bioinformatics with some applications focusing on topic models as way to cluster objects such as genes [1, 2], whereas others have used it in the more traditional document grouping context [3].

In text mining scenario, developing an LDA model for a set of documents is relatively straightforward (in R) – perform a series of pre-processing steps (mainly to standardize the text) such as converting everything to lower case, removing stopwords and so on. At the end of this one has a series of documents, each one being represented as a bag of words. The collection of words across all documents can be converted to a document-term matrix (documents in the rows, words in the columns) which is then used as input to the LDA routine.

Those familiar with building predictive models with keyed fingerprints will find this quite familiar – the individual bit positions represent structural fragments, thus are the chemical analogs of words. Based on this observation I wondered what I would get (and what it would mean) by applying a technique like LDA to a collection structures and their fragments.

My initial thought is that the use of LDA to determine a set of topics for a collection of chemical structures is essentially a clustering of the molecules, with the terms associated with the topics being representative substructures for that “cluster”. With these topics in hand, it wil be interesting to see what (or whether) properties (physical, chemical , biological) may be correlated with the clusters/topics identified. The rest of this post describes a quick first look at this, using ChEMBL as the source of structures and R for performing pre-processing and modeling.

Structures & fragments

We had previously fragmented ChEMBL (v8) in house, so obtaining the data was just a matter of running an SQL query to identify all fragments that occured in 50 or molecules and retrieving their structures and the molecules they were associated with. This gives us 190,252 molecules covered by 6,110 fragments. While a traditional text document-based modeling project would involved a series of pre-processing steps, the only one I need to perform in this scenario is the removal of small (and thus likely very common) fragments such as benzene – the cheminformatics equivalent of removing stopwords. (Ideally I would also remove fragments that already occur in other fragments – the cheminformatics equivalent of stemming)

The data file I have is of the form

 1 fragment_id, molregno, smiles, natom

where natom is the number of atoms in the fragment. The R code to generate (relatively) clean data, read to feed to the LDA function looks like:

 1234567 frags <- read.table('chembl.data', header=TRUE, as.is=TRUE, comment='', sep=',') names(frags) <- c('fid', 'molid', 'smiles', 'natom') frags <- subset(frags, natom >= 8) ## now we create the "documents" tmp <- by(frags, frags$molid, function(x) return( c(x$molid[1], join(x$smiles, ' ')))) tmp <- data.frame(do.call('rbind', tmp), stringsAsFactors=FALSE) names(tmp) <- c('title', 'text') In the code above, we rearrange the data to create “documents” – identified by a title (the molecule identifier) with the body of the document being the space concatenated SMILES for the fragments associated with that molecule. In other words, a molecule (document) is constructed from a set of fragments (words). With the data arranged in this form we can go ahead and reuse code from the tm and topicmodels packages.  1234 ## Get a document-term matrix library(tm) corpus <- Corpus(VectorSource(tmp$text)) dtm <- DocumentTermMatrix(corpus, control = list(tolower=FALSE))

Finally, we’re ready to develop some models, starting of with 6 topics.

 123 library(topicmodels) SEED <- 1234 lda.model <- LDA(dtm, k=6, control=list(seed=SEED))

So, what are the topics that have been identified? As I noted above, each topic is really a set of “words” that have a higher probability of being generated by that topic. In the case of this model we obtain the following top 4 fragments associated with each topic (most likely fragments are at the top of the table):

Visual inspection clearly suggests distinct differences in the topics – topic 1 appears to be characterized primarily by the lack of aromaticity, whereas topic 2 appears to be characterized by quinoline and indole type structures. This is just a rough inspection of the most likely “terms” for each topic. It’s also interesting to look at how the molecules (a.k.a., documents) are assigned to the topics. The barchart indicates the distribution of molecules amongst the 6 topics.

As with other unsupervised clustering methods, the choice of k (i.e., the number of topics) is tricky. A priori there is no reason to choose one over the other. Blei in his original paper used “perplexity” as a measure of the models generalizability (smaller values are better). In this case, we can vary k and evaluate the perplexity: with 6 topics the perplexity is 1122, with 12 topics it drops to 786 and with 100 topics it drops to 308 – you can see that it seems to continuously decrease with increase in number of topics (which has been observed elsewhere, though in my case, the hyperparameters are kept constant). Wallach et al have discussed various approaches to evaluating topic models.

Numerical evaluation of these models is useful, but we’re more interested in how these assignments correlate with chemical or biological features. First, one could look at the structural homogenity of the molecules assigned to topics. For k = 6, this is probably not useful, as the individual groups are very large. With k = 100 one obtains a much more sensible estimate of homogeneity (but this is to be expected). Another way to evaluate the topics from chemical point of view is to look at some property or activity. Given that ChEMBL provides assay and target information for the molecules, we have many ways to perform this evaluation. As a brief example, we can consider activity distrbutions derived from the molecules associated with each topic. Most ChEMBL molecules have multiple activities associated with them as many are tested in multiple assays. To allow comparison we converted activities in a given assay to Z-scores, allow comparison of activitives across assays. Then for each molecule, we identified the minimum activity, only considering those activities that were annotated as IC50 and as exact (i.e., not < or >). After removal of a few extreme outliers we obtain:

Clearly, within each group, the Z-scores cluster tightly around 0. It appears that the groups differentiate from each other in terms of the extreme values. Indeed plotting summary statistics for each group confirms this – in fact the median Z-score has a range of 0.05 and the mean Z-score a range of 0.11 across the six groups. In other words, the bulk of the groups are quite similar.

Other possibilities

The example shown here is rather simplistic and is the equivalent of unsupervised clustering. One obvious next step is to search the parameter space of the LDA model, evaluate different approaches to estimating the posterior distribution (EM or Gibbs sampling) and so on. A number of extensions to the basic LDA technique have been proposed, one of them being a supervised form of LDA.

It’d also be useful to look at this method on a slightly smaller, labeled dataset – I’ve run some preliminary experiments on the Bursi AMES but those results need a little more work. More generally, smaller datasets can be problematic as the number of unique fragments can be low. In addition fewer observations means that the estimates of the posterior distribution becomes fuzzier. One way around this is to develop a model on something like the ChEMBL dataset I used here and then apply that to smaller datasets. Obviously, this goes towards ideas of applicability – but given the size of ChEMBL, it may indeed “cover” many smaller datasets.

Is this useful?

At first sight, it’s an interesting method that identifies groupings in an unsupervised manner. Of course, one could easily run k-means or any of the hierarchical clustering methods to achieve the same result. However, the generative aspect of LDA models is what is of interest to me, but also seems the part that is difficult to map to a chemical setting – unlike topics in a document, which one can (usually) understand based on the likely terms for that topic, it’s not clear what a topic is for a collection of molecules in an unsupervised setting. And then, how does one infer the meaning of a topic from fragments? While it’s certainly true that certain fragments are associated with specific properties/activities, this is certainly not a given (unlike words, where each one does have an individual meaning). Furthermore, in an unsupervised setting like the one I’ve described here, fishing for a correlation between (some set of) properties and groupings of molecules is probably not the way to go.

Written by Rajarshi Guha

January 5th, 2012 at 4:45 am