Archive for February, 2011
A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).
I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.
I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.
The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).
So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.
… In my opinion, it is not a “hot” field, though, in part for some of the reasons mentioned in the post – particularly the fact that the data in the field is mostly proprietary and/or secret. So they hurt themselves by that behavior. But the other reason I don’t think it is moving that fast is that, unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological advances. In bioinformatics, the amazing progress in automated DNA sequencing has driven the science forward at a tremendous pace …
I agree with Steven and others that cheminformatics is not as “hot” as bioinformatics, based on varying metrics of hotness (groups, publications, funding, etc.). However I think the perceived lack of popularity stems from a number of reasons and that technological pushes are a minor reason. (Andrew Dalke noted some of these in a comment).
1. Lack of publicaly accessible data – this has been mentioned in various places and I believe is a key reason that held back the development of cheminformatics outside industry. This is not to say that academic groups weren’t doing cheminformatics in 70′s and 80′s, but we could’ve had a much richer ecosystem.
In this vein, it’s also important to note that just public structure data, while necessary, would likely not have been sufficient for cheminformatics developemnt. Rather, structure and biological activity data are both requred for the development of novel cheminformatics methodologies. (Of course certain aspects of cheminformatics are are focused purely on chemical structure, and as such do fine in the abensce of publically accesssible activity data).
2. Small molecules can make money directly – this is a primary driver for the previous point. A small molecule with confirmed activity against a target of interest can make somebody a lot of money. It’s not just that one molecule – analogs could be even more potent. As a result, the incentive to hold back swathes of structure and activity data is the financially sensible approach. (Whether this is actually useful is open to debate). On the other hand, sequence data is rarely commercialiable (though use of the sequence could be) and hence much easier to release.
3. Burden of knowledge – as I mentioned in my previous post, I believe that to make headway in many areas of cheminformatics requires some background in chemistry, sincce mathematical abstractions (cf graph representations) only take you so far. As Andrew noted, “Bioinformatics has an “overarching mathematical theory” because it’s based very directly on evolution, encoded in linear sequences“. As a result the theoretical underpinnings of much of bioinformatics make it more accessible to the broader community of CS and mathematics. This is not to say that new mathematical developments are not possible in cheminformatics – it’s just a much more complex topic to tackle.
4. Lack of federal funding – this is really a function of the above three points. The idea that it’s all been done in industry is something I’ve heard before at meetings. Obviously, with poor or no federal funding opportunities, fewer groups see cheminformatics as a “rewarding” field. While I still think the NIH’s cancellation of the ECCR program was pretty dumb, this is not to say that there is no federal funding for cheminformatics. Applications just have to be appropriately spun.
To address Stevens’ point regarding technology driving the science – I disagree. While large scale synthesis is possible in some ways (such as combinatorial libraries, diversity oriented synthesis etc.), just making large numbers of molecules is not really a solution. If it were, we might as well generate them virtually and work from the SMILES.
Instead, what is required is large scale activity measurements. And there have been technology developments that allow one to generate large amounts of structure-actvity data – namely, High Throughput Screening (HTS) technologies. Admittedly, the data so generated is not near the scale of sequencing – but at the same time, compared to sequencing, every HTS project usually requires some form of unique optimization of assay conditions. Added to that, we’re usually looking at a complex system and not just a nucleotide sequence and it’s easy to see why HTS assays are not going to be at the scale of next gen sequencing.
But, much of this technology was relegated to industry. It’s only in the last few years that HTS technology has been accesible outside industry and efforts such as the Molecular Libraries Initiative have made great strides in getting HTS technologies to academics and more importantly, making the results of these screens publicaly available.
As a bioinformatics bystander, while I see reports of next gen sequencing pushing out GBs and TBs of data and hence the need for new bioinformatics methods – I don’t see a whole lot of “new” bioinformatics. To me it seems that its just variations of putting together sequences faster – which seems a rather narrow area, if that’s all that is being pushed by these technological developments. (I have my asbestos underwear on, so feel free to flame)
Certainly, bioinformatics is helped by high profile projects such as the Human Genome Project and the more recent 1000 Genomes project which certainly have great gee-whiz factors. What might be an equivalent for cheminformatics? I’m not sure – but I’d guess something on the lines of systems biology or systems chemical biology might be a possibility.
Or maybe cheminformatics just needs to become “small molecule bioinformatics”?
… my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry …
The review/bridging was a pretty thorough summary of the book, but the blog post as well as the comments raised a number of interesting issues that I think are worth discussing. Aaron notes
… Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind …
While the HCA may be the first compilation of cheminformatics-related algorithms in a single place, cheminformatics actually has a pretty long lineage, starting back in the 1960′s. Examples include canonicalization (Morgan, 1965) and ring perception (Hendrickson, 1961). See here for a short history of cheminformatics. Granted these are not CS journals, but that doesn’t mean that cheminformatics is a new field. Bioinformatics also seems to have a similar lineage (see this Biostar thread) with some seminal papers from the 1960′s (Dayhoff et al, 1962). Interestingly, it seems that much of the most-cited literature (alignments etc.) in bioinformatics comes from the 90′s.
Aaron then goes onto note that “there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA“. In some ways this is correct – a number of cheminformatics topics could be considered ad-hoc, rather than grounded in rigorous mathematical proofs. But there are topics, primarily in the graph theoretical areas, that are pretty rigorous. I think Aarons choice of complexity descriptors as an example is not particularly useful – granted it is easy to understand without a background in cheminformatics, but from a practical perspective, complexity descriptors tend to have limited use, synthetic feasibility being one case. (Indeed, there is an ongoing argument about whether topological 2D descriptors are useful and much of the discussion depends on the context). All the points that Aaron notes are correct: induction on small examples, lack of a formal framework for comparison, limited explanation of the utility. Indeed, these comments can be applied to many cheminformatics research reports (cf. “my FANCY-METHOD model performed 5% better on this dataset” style papers).
But this brings me to my main point – many of the real problems addressed by cheminformatics cannot be completely (usefully) abstracted away from the underlying chemistry and biology. Yes, a proof of the lower bounds on the calculation of a molecular complexity descriptor is interesting; maybe it’d get you a paper in a TCS journal. However, it is of no use to a practising chemist in deciding what molecule to make next. The key thing is that one can certainly start with a chemical graph, but in the end it must be tied back to the actual chemical & biological problem. There are certainly examples of this such as the evaluation of bounds on fingerprint similarity (Swamidass & Baldi, 2007). I believe that this stresses the need for real collaborations between TCS, cheminformatics and chemistry.
As another example, Aaron uses the similarity principle (Martin et al, 2002) to explain how cheminformatics measures similarity in different ways and the nature of problems tacked by cheminformatics. One anonymous commenter responds
… I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous …
In my opinion, the commenter has never worked on real chemical problems, or is of the belief that chemistry can be abstracted into some “pure” framework, divorced from reality. The fact of the matter is that, from a physical point of view, similar molecules do in many cases exhibit similar behaviors. Conversely, there are many cases where similar molecules exhibit significantly different behaviors (Maggiora, 2006). But this is reality and is what cheminformatics must address. In other words, cheminformatics in the absence of chemistry is just symbols on paper.
Aaron, as well as number of commenters, notes that one of the reasons holding back cheminformatics is public access to data and tools. For data, this was indeed the case for a long time. But over the last 10 years or so, a number of large public access databases have become available. While one can certainly argue about the variability in data quality, things are much better than before. In terms of tools, open source cheminformatics tools are also relatively recent, from around 2000 or so. But, as I noted in the comment thread, there is a plethora of open source tools that one can use for most cheminformatics computations, and in some areas are equivalent to commercial implementations.
My last point, which is conjecture on my part, is that one reason for the higher profile of bioinformatics in the CS community is that is has a relatively lower barrier to entry for a non-biologist (and I’ll note that this is likely not a core reason, but a reason nonetheless). After all, the bulk of bioinformatics revolves around strings. Sure there are topics (protein structure etc) that are more physical and I don’t want to go down the semantic road of what is and what is not bioinformatics. But my experience as a faculty member in a department with both cheminformatics and bioinformatics, seems to suggest to me that, coming from a CS or math background, it is easier to get up to speed on the latter than the former. I believe that part of this is due to the fact that while both cheminformatics and bioinformatics are grounded in common, abstract data structures (sequences, graphs etc), one very quickly runs into the nuances of chemical structure in cheminformatics. An alternative way to put it is that much of bioinformatics is based on a single data type – properties of sequences. On the other hand, cheminformatics has multiple data types (aka structural representations) and which one is best for a given task is not always apparent. (Steve Salzberg also made a comment on the higher profile of bioinformatics, which I’ll address in an upcoming post).
In summary, I think Aarons post was very useful as an attempt at bridge building between two communities. Some aspects could have been better articulated – but the fact is, CS topics have been a core part of cheminformatics for a long time and there are ample problems yet to be tackled.
Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).
To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.
Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.
I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.
While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.