Tree Widths and Chemical Graphs

A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).

I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.

I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.

The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).

So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.

Cheminformatics and Hotness (or lack thereof)

A few days back I discussed some thoughts on cheminformatics vis a vis bioinformatics, inspired by a review by Aaron Sterling. In that thread, Steven Salzberg made a comment, stating

… In my opinion, it is not a “hot” field, though, in part for some of the reasons mentioned in the post – particularly the fact that the data in the field is mostly proprietary and/or secret. So they hurt themselves by that behavior. But the other reason I don’t think it is moving that fast is that, unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological advances. In bioinformatics, the amazing progress in automated DNA sequencing has driven the science forward at a tremendous pace …

I agree with Steven and others that cheminformatics is not as “hot” as bioinformatics, based on varying metrics of hotness (groups, publications, funding, etc.). However I think the perceived lack of popularity stems from a number of reasons and that technological pushes are a minor reason. (Andrew Dalke noted some of these in a comment).

1. Lack of publicaly accessible data – this has been mentioned in various places and I believe is a key reason that held back the development of cheminformatics outside industry. This is not to say that academic groups weren’t doing cheminformatics in 70’s and 80’s, but we could’ve had a much richer ecosystem.

In this vein, it’s also important to note that just public structure data, while necessary, would likely not have been sufficient for cheminformatics developemnt. Rather, structure and biological activity data are both requred for the development of novel cheminformatics methodologies. (Of course certain aspects of cheminformatics are are focused purely on chemical structure, and as such do fine in the abensce of publically accesssible activity data).

2. Small molecules can make money directly – this is a primary driver for the previous point. A small molecule with confirmed activity against a target of interest can make somebody a lot of money. It’s not just that one molecule – analogs could be even more potent. As a result, the incentive to hold back swathes of structure and activity data is the financially sensible approach. (Whether this is actually useful is open to debate). On the other hand, sequence data is rarely commercialiable (though use of the sequence could be) and hence much easier to release.

3. Burden of knowledge – as I mentioned in my previous post, I believe that to make headway in many areas of cheminformatics requires some background in chemistry, sincce mathematical abstractions (cf graph representations) only take you so far. As Andrew noted, “Bioinformatics has an “overarching mathematical theory” because it’s based very directly on evolution, encoded in linear sequences“. As a result the theoretical underpinnings of much of bioinformatics make it more accessible to the broader community of CS and mathematics. This is not to say that new mathematical developments are not possible in cheminformatics – it’s just a much more complex topic to tackle.

4. Lack of federal funding – this is really a function of the above three points. The idea that it’s all been done in industry is something I’ve heard before at meetings. Obviously, with poor or no federal funding opportunities, fewer groups see cheminformatics as a “rewarding” field. While I still think the NIH’s cancellation of the ECCR program was pretty dumb, this is not to say that there is no federal funding for cheminformatics. Applications just have to be appropriately spun.

To address Stevens’ point regarding technology driving the science – I disagree. While large scale synthesis is possible in some ways (such as combinatorial libraries, diversity oriented synthesis etc.), just making large numbers of molecules is not really a solution. If it were, we might as well generate them virtually and work from the SMILES.

Instead, what is required is large scale activity measurements. And there have been technology developments that allow one to generate large amounts of structure-actvity data – namely, High Throughput Screening (HTS) technologies. Admittedly, the data so generated is not near the scale of sequencing – but at the same time, compared to sequencing, every HTS project usually requires some form of unique optimization of assay conditions. Added to that, we’re usually looking at a complex system and not just a nucleotide sequence and it’s easy to see why HTS assays are not going to be at the scale of next gen sequencing.

But, much of this technology was relegated to industry. It’s only in the last few years that HTS technology has been accesible outside industry and efforts such as the Molecular Libraries Initiative have made great strides in getting HTS technologies to academics and more importantly, making the results of these screens publicaly available.

As a bioinformatics bystander, while I see reports of next gen sequencing pushing out GBs and TBs of data and hence the need for new bioinformatics methods – I don’t see a whole lot of “new” bioinformatics. To me it seems that its just variations of putting together sequences faster – which seems a rather narrow area, if that’s all that is being pushed by these technological developments. (I have my asbestos underwear on, so feel free to flame)

Certainly, bioinformatics is helped by high profile projects such as the Human Genome Project and the more recent 1000 Genomes project which certainly have great gee-whiz factors.  What might be an equivalent for cheminformatics? I’m not sure – but I’d guess something on the lines of systems biology or systems chemical biology might be a possibility.

Or maybe cheminformatics just needs to become “small molecule bioinformatics”?

Cheminformatics – the New World for TCS?

A few weeks back Aaron Sterling posted a review of the Handbook of Cheminformatics Algorithms (in which I have a chapter). Aaron notes

… my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry …

The review/bridging was a pretty thorough summary of the book, but the blog post as well as the comments raised a number of interesting issues that I think are worth discussing. Aaron notes

… Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind …

While the HCA may be the first compilation of cheminformatics-related algorithms in a single place, cheminformatics actually has a pretty long lineage, starting back in the 1960’s. Examples include canonicalization (Morgan, 1965) and ring perception (Hendrickson, 1961). See here for a short history of cheminformatics. Granted these are not CS journals, but that doesn’t mean that cheminformatics is a new field. Bioinformatics also seems to have a similar lineage (see this Biostar thread) with some seminal papers from the 1960’s (Dayhoff et al, 1962). Interestingly, it seems that much of the most-cited literature (alignments etc.) in bioinformatics comes from the 90’s.

Aaron then goes onto note that “there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA“. In some ways this is correct – a number of cheminformatics topics could be considered ad-hoc, rather than grounded in rigorous mathematical proofs. But there are topics, primarily in the graph theoretical areas, that are pretty rigorous. I think Aarons choice of complexity descriptors as an example is not particularly useful – granted it is easy to understand without a background in cheminformatics, but from a practical perspective, complexity descriptors tend to have limited use, synthetic feasibility being one case. (Indeed, there is an ongoing argument about whether topological 2D descriptors are useful and much of the discussion depends on the context). All the points that Aaron notes are correct: induction on small examples, lack of a formal framework for comparison, limited explanation of the utility. Indeed, these comments can be applied to many cheminformatics research reports (cf. “my FANCY-METHOD model performed 5% better on this dataset” style papers).

But this brings me to my main point – many of the real problems addressed by cheminformatics cannot be completely (usefully) abstracted away from the underlying chemistry and biology. Yes, a proof of the lower bounds on the calculation of a molecular complexity descriptor is interesting; maybe it’d get you a paper in a TCS journal. However, it is of no use to a practising chemist in deciding what molecule to make next. The key thing is that one can certainly start with a chemical graph, but in the end it must be tied back to the actual chemical  & biological problem. There are certainly examples of this such as the evaluation of bounds on fingerprint similarity (Swamidass & Baldi, 2007). I believe that this stresses the need for real collaborations between TCS, cheminformatics and chemistry.

As another example, Aaron uses the similarity principle (Martin et al, 2002) to explain how cheminformatics measures similarity in different ways and the nature of problems tacked by cheminformatics. One anonymous commenter responds

… I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous …

In my opinion, the commenter has never worked on real chemical problems, or is of the belief that chemistry can be abstracted into some “pure” framework, divorced from reality. The fact of the matter is that, from a physical point of view, similar molecules do in many cases exhibit similar behaviors. Conversely, there are many cases where similar molecules exhibit significantly different behaviors (Maggiora, 2006). But this is reality and is what cheminformatics must address. In other words, cheminformatics in the absence of chemistry is just symbols on paper.

Aaron, as well as number of commenters, notes that one of the reasons holding back cheminformatics is public access to data and tools. For data, this was indeed the case for a long time. But over the last 10 years or so, a number of large public access databases have become available. While one can certainly argue about the variability in data quality, things are much better than before. In terms of tools, open source cheminformatics tools are also relatively recent, from around 2000 or so. But, as I noted in the comment thread, there is a plethora of open source tools that one can use for most cheminformatics computations, and in some areas are equivalent to commercial implementations.

My last point, which is conjecture on my part, is that one reason for the higher profile of bioinformatics in the CS community is that is has a relatively lower barrier to entry for a non-biologist (and I’ll note that this is likely not a core reason, but a reason nonetheless). After all, the bulk of bioinformatics revolves around strings. Sure there are topics (protein structure etc) that are more physical and I don’t want to go down the semantic road of what is and what is not bioinformatics. But my experience as a faculty member in a department with both cheminformatics and bioinformatics, seems to suggest to me that, coming from a CS or math background, it is easier to get up to speed on the latter than the former. I believe that part of this is due to the fact that while both cheminformatics and bioinformatics are grounded in common, abstract data structures (sequences, graphs etc), one very quickly runs into the nuances of chemical structure in cheminformatics. An alternative way to put it is that much of bioinformatics is based on a single data type – properties of sequences. On the other hand, cheminformatics has multiple data types (aka structural representations) and which one is best for a given task is not always apparent. (Steve Salzberg also made a comment on the higher profile of bioinformatics, which I’ll address in an upcoming post).

In summary, I think Aarons post was very useful as an attempt at bridge building between two communities. Some aspects could have been better articulated – but the fact is, CS topics have been a core part of cheminformatics for a long time and there are ample problems yet to be tackled.

CDK & logP Values

Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).

To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.

Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.

I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.

While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.

Caching SMARTS Queries

Andrew Dalke recently published a detailed write up on his implementation of the Pubchem fingerprints and provided a pretty thorough comparison with the CDK implementation. He pointed out a number of bugs in the CDK version; but he also noted that performance could be improved by caching parsed SMARTS queries – which are used extensively in this fingerprinter. So I wanted to see whether caching really helps.

The CDK SMARTS parser is a JJTree based implementation and my anecdotal evidence suggested that SMARTS parsing was not a real bottleneck. But of course, nothing beats measurements. So I modified the SMARTSQueryTool class to cache parsed SMARTS queries using a simple LRU cache (patch):

1
2
3
4
5
6
final int MAX_ENTRIES = 20;
Map<String, QueryAtomContainer> cache = new LinkedHashMap<String, QueryAtomContainer>(MAX_ENTRIES + 1, .75F, true) {
    public boolean removeEldestEntry(Map.Entry eldest) {
        return size() > MAX_ENTRIES;
    }
};

The map is keyed on the SMARTS string. Then, when we attempt to parse a new SMARTS query, we check if it’s in the cache and if not, parse it and place it in the cache.

Now, the thing about this caching mechanism is that after 20 SMARTS queries have been cached, the least recently used one is replaced with a new one. As a result, if we perform matching with 40 unique SMARTS (in sequence) only the last 20 get cached, for a given molecule. But when we go to do it on a new molecule, the first 20 are not in the cache and hence we shouldn’t really benefit from the caching scheme. In general, if the fingerprinter (or any caller of the parser) will perform N unique SMARTS queries for a single molecule, the cache size must be at least N, for the cache to be used for subsequent molecules.

I implemented a quick test harness, reading in 100 SMILES and then generating Pubchem fingerprints for each molecule. The fingerprint generation was repeated 5 times and the time reported for each round. The box plot shows the distribution of the timings. Now, the CDK implementation has 621 SMARTS patterns – as you can see, we only get a benefit from the caching when the cache size is 700. In fact, cache sizes below that lead to a performance hit  – I assume due to time required to query the map.

While the performance improvement is not dramatic it is close to 10% compared to no-caching at all. In actuality, the major bottleneck in the SMARTS parser is the actual graph isomorphism step (which we hope to drastically improve by using the SMSD code). Maybe then, SMARTS parsing will take a bigger fraction of the time. Also keep in mind that due to the heavyweight nature of CDK molecule objects, very large caches could be a strain on memory. But to check that out, I should use a profiler.