Archive for the ‘Literature’ Category
Edit 10/9/14 – Updated statistics for the 1024 bit fingerprints
There’s been some discussion about a paper by O’Hagan et al that have proposed a Rule of 0.5 that states that 90% of approved drugs exhibit a Tanimoto similarity > 0.5 to one or more human metabolites. Their analysis is based on metabolites listed in Recon2, a reconstruction of the human metabolic network. The idea makes sense and there’s an in depth discussion at In the Pipeline.
Given the authors’ claim that
a successful drug is likely to lie within a Tanimoto distance of 0.5 of a known human metabolite. While this does not mean, of course, that a molecule obeying the rule is likely to become a marketed drug for humans, it does mean that a molecule that fails to obey the rule is statistically most unlikely to do so
I was interested in seeing how this rule of thumb holds up when faced with compounds that are not supposed to make it through the drug development pipeline. Since PAINS appear to be the structural filter du jour, I decided to look at compounds that failed the PAINS filter. I worked with the 10,000 compounds included in Saubern et al. Simon Saubern provided me the set of 861 compounds that failed the PAINS filters, allowing me to extract the set of compounds that passed (9139)
Chris Swain was kind enough to extract the compound entries from the Matlab dump provided by O’Hagan et al. This file contained InChI representations for a subset of the entries. I extracted the 2980 valid InChI strings and converted them to SMILES using ChemAxon molconvert 6.0.5. The processed data (metabolite name, InChI and SMILES) are available here. However, after deduplication, there were 1335 unique metabolites
Now, O’Hagan et al for some reason, used the 166 bit MACCS keys, but hashed them to 1024 bits. Usually, when using a keyed fingerprint, the goal is to retain the correspondence between bit position and substructure. The hashing step results in a loss of such correspondence. So it’s a bit surprising that they didn’t use some sort of path (Daylight) or environment (ECFPn) based fingerprint. Since I didn’t know how they hashed the MACCS keys, I calculated 166 bit MACCS keys and 1024 bt ECFP6 and extended path fingerprints using the CDK (via rcdk). Then for each compound in the PAINS pass or fail set, I computed the similarity to each of the 1335 metabolites and identified the maximum similarity (termed NMTS in the paper) and then plotted the distribution of these NMTS values between the PAINS pass and fail sets.
First, the similarity cutoff proposed by the authors is obiously dependent on the fingerprint. So while the bulk of the 166 bit MACCS similarities are > 0.5, this is not really meaningful. A more relevant comparison is to 1024 bit fingerprints – both are hashed, so should be somewhat comparable to the authors choice of hashed MACCS keys.
The path fingerprints lead to an NMTS of ~ 0.25 for both PAINS pass and fail sets and the ECFP6 leads to an NMTS of ~ 0.18 for both sets. Though the difference in medians between the pass and fail sets for the path fingerprint is statistically significant (p = 1.498e-05, Wilcoxon test), the difference itself is very small: 0.005. (For the circular fingerprint there is no statistically significant difference). However, the PAINS pass set does contain more outliers with values > 0.5. In that sense the proposed rule does separate the two groups. Of the top of my head I don’t know whether the WEHI screening deck that was the source of the 10,000 compounds was designed to be drug-like. At the same time all this might be saying is there is no relationship between metabolite-likenes and PAINS-likeness.
It’d be interesting to see how this type of analysis holds up with other well known filter rules (REOS, Lilly etc). A related thing to look at would be to see how druglikeness scores compare with NMTS values.
Code and data are available in this repository
UPDATE (3/21) – I was contacted by the author of the paper who pointed out that my analysis was based on a misunderstanding of the paper. Specifically
- The primary goal of WES is to identify actives – and according to the authors definition, the most interesting actives (that should be ranked highly) are those that have no dose response and show a constant activity equal to the positive control. Next in importance are compounds that exhibit a dose response. Finally the least interesting (and so lowest ranked) are those that show no dose response and are flat at the negative control level.
- The WES method requires that data be normalized such that DMSO (i.e., negative control) is at 0 and positive control is at 100%.
Since my analysis was based on the wrong normalization scheme the conclusions were erroneous. When the proper normalization is taken into account, the method works as advertised in that it correctly ranks compounds that show constant activity at the positive control level at the top, followed by curves with a dose response and finally with inactives (no activity at all) at the bottom.
Based on this I’ve updated the figures and text to correct my mistake. However, in my opinion, if the goal is to identify compounds that have a constant activity one does not need to go to entropy. In addition, for the case of compounds with a well defined dose response, the WES essentially ranks them by potency (assuming a valid curve fit). The updated text goes on to discuss these aspects.
UPDATE (2/25) – Regenerated the enrichment curves so that data was ranked in the correct order when LAC50 was being used.
I came across a paper that describes the use of weighted entropy to rank order dose response curves. As this data type is the bread and butter of my day job a simple ranking method is always of interest to me. While the method works as advertised, it appears to be a rather constrained method and doesn’t seem to do a whole let better than simpler, pre-existing approaches.
The paper correctly notes that there is no definitive protocol to rank compounds using their dose response curves. Such rankings are invariably problem dependent – in some cases, simple potency based ranking of good quality curves is sufficient. In other cases structural clustering combined with a measure of potency enrichment is more suitable. In addition, it is also true that all compounds in a screen do not necessarily fit well to a 4-parameter Hill model. This may simply be due to noise but could also be due to some process that is better fit by some other model (bell or U shaped curves). The point being that rankings based on a pre-defined model may not be useful or accurate.
The paper proposes the use of entropy as a way to rank dose response curves in a model-free manner. While a natural approach is to use Shannon entropy, the author suggests that the equal weighting implicit in the calculation is unsuitable. Instead, the use of weighted entropy (WES) is proposed as a more robust approach that takes into account unreliable data points. The author defines the weights based on the level of detection of the assay (though I’d argue that since the intended goal is to capture the reliability of individual response points, a more appropriate weight should be derived from some form of variance – either from replicate data or else pooled across the collection) . The author then suggests that curves should be ranked by the WES value, with higher values indicating a better rank.
For any proposed ranking scheme, one must first define what the goal is. When ranking dose response curves are we looking for compounds
- that exhibit well defined dose response (top and bottom asymptotes, > 80% efficacy etc)?
- good potency, even if the curve is not that well fit?
- compounds with a specific chemotype?
According to the paper, a key goal is to be able to identify compounds that show a constant activity – and within such compounds the more interesting ones are those that have constant activity = 100%. While I disagree that these are the most interesting compounds, it is not clear why one would need an entropy based method to identify such constant-activity curves (either at 100% or 0%).
More generally, for well defined dose response curves, the WES, by definition, tracks potency. This can be seen in the figure alongside that plots the WES value vs the log AC50 for a set of 27 good quality curves taken from a screen of 1408 AR agonists. Granted, when no model can be fit, one does not have an AC50, whereas a WES can be evaluated. But in such a case it’s not clear why one would necessarily want to quantify presumably noisy data.
However, going along with the authors definition, the method does distinguish valid dose responses from inactives (though again, one does not require entropy to make such a distinction!) as shown in the adjoining figure. It is clear from the definition of WES that a curve that is flat at 100% will exhibit the maximum value of WES and so will always rank high.
One way to to test the performance of ranking methods this is to take a collection of curves, rank them by a measure and identify how many actives are identified in the top N% of the collection, for varying N. Ideally, a good ranking would identify nearly all the actives for a small N. If the ranking were random one would identify N% of the actives in the top N% of the collection. Here an active is defined in terms of curve class, a heuristic that we use to initially weed out poor quality curves and focus on good quality ones. I defined active as curve classes 1.1, 1.2, 2.2 and 2.1 (see here for a summary of curve classes).
As pointed out by the author during our conversation, this is not an entirely fair comparison – since my scheme does not consider a flat curve at 100% as active. Though it’s a valid point, the dataset I worked with did not have any such curves. More generally, such curves would be the exception in a qHTS screen (assuming the concentration ranges have been correctly chosen). From that point of view, one should be able to apply WES to generate a ranking for any qHTS screen otherwise one would have to inspect the curves first to ensure that it contains such “flat actives” and then apply WES. Which is not the right way to go about it.
As shown in the enrichment plot shown alongside (generated for the 1408 compound AR agonist dataset), WES works better than random (and much better than the standard Shannon entropy), but is still outperformed by the area under the dose response curve (AUC) and potency. I certainly don’t claim that AUC is a completely robust way to rank dose response curves (in fact for some cases such as invalid curve fits, it’d be nonsensical). I also include LAC50, the logarithm of the AC50, as a ranking method simply because the paper considers it a poor way to rank curves (which I agree with, particularly if one does not first filter for good quality, efficacious curves).
There are a few other issues, though I think the most egregious one was that the method was tested on just one dataset. I’m not convinced that a single dataset represents a sufficient validation (given that Tox21 has about 80 published bioassays in PubChem). But that’s a case of poor reviewing rather than a technical flaw.
Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.
It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.
In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.
A key motivation for the new graph representation and deep learning approach was the observation
one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction
A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:
- The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
- The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
- It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.
One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.
Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.
I came across an ASAP paper today describing substructure searching in Oracle databases. The paper comes from the folks at J & J and is part of their series of papers on the ABCD platform. Performing substructure searches in databases is certainly not a new topic and various products are out there that support this in Oracle (as well as other RDBMSs). The paper describes how the ABCD system does this using a combination of structure-derived hash keys and an inverted bitset based index and discuss their implementation as an Oracle cartridge. They provide an interesting discussion of how their implementation supports Cost Based Optimization of SQL queries involving substructure search. The authors run a number of benchmarks. In terms of comparative benchamrks they compare the performance (i.e., screening efficiency) of their hashed keys versus MACCS keys, CACTVS keys and OpenBabel FP2 fingerprints. Their results indicate that the screening step is a key bottleneck in the query process and that their hash key is generally more selective than the others.
Unfortunately, what would have been interesting but was not provided was a comparison of the performance at the Oracle query level with other products such as JChem Cartridge and OrChem. Furthermore, the test case is just under a million molecules from Golovin & Henrick – the entire dataset (not just the keys) could probably reside in-memory on todays servers. How does the system perform when say faced with PubChem (34 million molecules)? The paper mentions a command line implementation of their search procedure, but as far as I can tell, the Oracle cartridge is not available.
The ABCD system has many useful and interesting features. But as with the other publications on this system, this paper is one more in the line of “Papers About Systems You Can’t Use or Buy“. Unfortunate.
… my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry …
The review/bridging was a pretty thorough summary of the book, but the blog post as well as the comments raised a number of interesting issues that I think are worth discussing. Aaron notes
… Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind …
While the HCA may be the first compilation of cheminformatics-related algorithms in a single place, cheminformatics actually has a pretty long lineage, starting back in the 1960′s. Examples include canonicalization (Morgan, 1965) and ring perception (Hendrickson, 1961). See here for a short history of cheminformatics. Granted these are not CS journals, but that doesn’t mean that cheminformatics is a new field. Bioinformatics also seems to have a similar lineage (see this Biostar thread) with some seminal papers from the 1960′s (Dayhoff et al, 1962). Interestingly, it seems that much of the most-cited literature (alignments etc.) in bioinformatics comes from the 90′s.
Aaron then goes onto note that “there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA“. In some ways this is correct – a number of cheminformatics topics could be considered ad-hoc, rather than grounded in rigorous mathematical proofs. But there are topics, primarily in the graph theoretical areas, that are pretty rigorous. I think Aarons choice of complexity descriptors as an example is not particularly useful – granted it is easy to understand without a background in cheminformatics, but from a practical perspective, complexity descriptors tend to have limited use, synthetic feasibility being one case. (Indeed, there is an ongoing argument about whether topological 2D descriptors are useful and much of the discussion depends on the context). All the points that Aaron notes are correct: induction on small examples, lack of a formal framework for comparison, limited explanation of the utility. Indeed, these comments can be applied to many cheminformatics research reports (cf. “my FANCY-METHOD model performed 5% better on this dataset” style papers).
But this brings me to my main point – many of the real problems addressed by cheminformatics cannot be completely (usefully) abstracted away from the underlying chemistry and biology. Yes, a proof of the lower bounds on the calculation of a molecular complexity descriptor is interesting; maybe it’d get you a paper in a TCS journal. However, it is of no use to a practising chemist in deciding what molecule to make next. The key thing is that one can certainly start with a chemical graph, but in the end it must be tied back to the actual chemical & biological problem. There are certainly examples of this such as the evaluation of bounds on fingerprint similarity (Swamidass & Baldi, 2007). I believe that this stresses the need for real collaborations between TCS, cheminformatics and chemistry.
As another example, Aaron uses the similarity principle (Martin et al, 2002) to explain how cheminformatics measures similarity in different ways and the nature of problems tacked by cheminformatics. One anonymous commenter responds
… I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous …
In my opinion, the commenter has never worked on real chemical problems, or is of the belief that chemistry can be abstracted into some “pure” framework, divorced from reality. The fact of the matter is that, from a physical point of view, similar molecules do in many cases exhibit similar behaviors. Conversely, there are many cases where similar molecules exhibit significantly different behaviors (Maggiora, 2006). But this is reality and is what cheminformatics must address. In other words, cheminformatics in the absence of chemistry is just symbols on paper.
Aaron, as well as number of commenters, notes that one of the reasons holding back cheminformatics is public access to data and tools. For data, this was indeed the case for a long time. But over the last 10 years or so, a number of large public access databases have become available. While one can certainly argue about the variability in data quality, things are much better than before. In terms of tools, open source cheminformatics tools are also relatively recent, from around 2000 or so. But, as I noted in the comment thread, there is a plethora of open source tools that one can use for most cheminformatics computations, and in some areas are equivalent to commercial implementations.
My last point, which is conjecture on my part, is that one reason for the higher profile of bioinformatics in the CS community is that is has a relatively lower barrier to entry for a non-biologist (and I’ll note that this is likely not a core reason, but a reason nonetheless). After all, the bulk of bioinformatics revolves around strings. Sure there are topics (protein structure etc) that are more physical and I don’t want to go down the semantic road of what is and what is not bioinformatics. But my experience as a faculty member in a department with both cheminformatics and bioinformatics, seems to suggest to me that, coming from a CS or math background, it is easier to get up to speed on the latter than the former. I believe that part of this is due to the fact that while both cheminformatics and bioinformatics are grounded in common, abstract data structures (sequences, graphs etc), one very quickly runs into the nuances of chemical structure in cheminformatics. An alternative way to put it is that much of bioinformatics is based on a single data type – properties of sequences. On the other hand, cheminformatics has multiple data types (aka structural representations) and which one is best for a given task is not always apparent. (Steve Salzberg also made a comment on the higher profile of bioinformatics, which I’ll address in an upcoming post).
In summary, I think Aarons post was very useful as an attempt at bridge building between two communities. Some aspects could have been better articulated – but the fact is, CS topics have been a core part of cheminformatics for a long time and there are ample problems yet to be tackled.