Archive for the ‘qsar’ tag
A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).
I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.
I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.
The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).
So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.
Sometime back John Van Drie and I had developed the Structure Activity Landscape Index (SALI), which is a way to quantify activity cliffs – pairs of compounds which are structurally very similar but have significantly different activities. In preparation for a talk on SALI at the Boston ACS, I was looking for SAR datasets that contained cliffs. It turns out that ChEMBL is a a great resource for SAR data. And with the EBI providing database dumps it’s very easy to query across the entire collection to find datasets of interest.
For the purposes of this talk, I wanted to see what the datasets looked like in terms of the presence (or absence of cliffs). Given that the idea of an activity cliff is only sensible for ligand receptor type interactions, I only considered compound sets associated with binding assays. Furthermore, I only considered those assays which involved human targets, had a confidence score greater than 8 and contained between 75 and 500 molecules. (If you have an Oracle installation of ChEMBL then this SQL snippet will get you the list of assays satisfying these constraints).
This gives us 31 assays, which we can now analyze. For the purposes of this note, I evaluated the CDK hashed fingerprints and used the standardized activities to generate the pairwise SALI values for each of the datasets (performing the appropriate log transformation of the activities when required). The matrices that represent the pairwise SALI values are plotted in the heatmap montage below (the ChEMBL assay ID is noted in each image) where black represents the minimum SALI value and white represents the maximum SALI value for that dataset. (See the original paper for more details on this representation.) Clearly, the “roughness” of the activity landscape differs from dataset to dataset.
At this point I haven’t looked in depth into each dataset to characterize the landscapes in more detail, but this is a quick summary of multiple datasets. (Though a few datasets contain cliffs which are derived from stereoiomers and hence may not actually be real cliffs – since their activity difference may be small, but will look structurally identical to the fingerprint).
An alternative and useful representation is to convert the SALI values for a dataset into an empirical cumulative distribution function to provide a more quantitative view of how cliffs are distributed within a landscape. I’ll leave those details for the talk.
Version 1.0.5 of the CDK descriptor calculator is now available. This version updates the command line batch mode to allow one to calculate a specific set of descriptors (as opposed to all or say, topological). The selected descriptors are specified using an XML file, which can be generated in the GUI mode – fire up the calculator in GUI mode, check the selected descriptors and then save the selection. You can then specify the selection file via the -s option.
Recently there have been two papers asking whether cheminformatics or virtual screening in general, have really helped drug discovery, in terms of lead discovery.
The first paper from Muchmore et al focuses on the utility of various cheminformatics tools in drug discovery. Their report is retrospective in nature where they note that while much research has been done in developing descriptors and predictors of various molecular properties (solubility, bioavilability etc), it does not seem that this has contributed to increased productivity. They suggest three possible reasons for this
- not enough time to judge the contributions of cheminformatics methods
- methods not being used properly
- methods themselves not being sufficiently accurate.
They then go on consider how these reasons may apply to various cheminformatics methods and tools that are accessible to medicinal chemists. Examples range from molecular weight and ligand efficiency to solubility, similarity and bioisosteres. They use a 3-class scheme – known knowns, unknown knowns and unknown unknowns corresponding to methods whose underlying principles are whose results can be robustly interpreted, methods for properties that we don’t know how to realistically evaluate (but which we may still do so – such as solubility) and methods for which we can get a numerical answer but whose meaning or validity is doubtful. Thus for example, ligand binding energy calculations are placed in the “unknown unknown” category and similarity searches are placed in the “known unknown” category.
It’s definitely an interesting read, summarizing the utility of various cheminformatics techniques. It raises a number of interesting questions and issues. For example, a recurring issue is that many cheminformatics methods are ultimately subjective, even though the underlying implementation may be quantitative – “what is a good Tanimoto cutoff?” in similarity calculations would be a classic example. The downside of the article is that it does appear at times to be specific to practices at Abbott.
The second paper is by Schneider and is more prospective and general in nature and discusses some reasons as to why virtual screening has not played a more direct role in drug discovery projects. One of the key points that Schneider makes is that
appropriate “description of objects to suit the problem” might be the key to future success
In other words, it may be that molecular descriptors, while useful surrogates of physical reality, are probably not sufficient to get us to the next level. Schneider even states that “… the development of advanced virtual screening methods … is currently stagnated“. This statement is true in many ways, especially if one considers the statistical modeling side of virtual screening (i.e., QSAR). Many recent papers discuss slight modifications to well known algorithms that invariably lead to an incremental improvement in accuracy. Schneider suggests that improvements in our understanding of the physics of the drug discovery problem – protein folding, allosteric effects, dynamics of complex formation, etc – rather than continuing to focus on static properties (logP etc) will lead to advances. Another very valid point is that future developments will need to move away from the prediction or modeling of “… one to one interactions between a ligand and a single target …” and instead will need to consider “… many to many relationships …“. In other words, advances in virtual screen will address (or need to address) the ligand non-specificity or promiscuity. Thus activity profiles, network models and polyparmacology will all be vital aspects of successful virtual screening.
I really like Schneiders views on the future of virtual screening, even though they are rather general. I agree with his views on the stagnation of machine learning (QSAR) methods but at the same time I’m reminded of a paper by Halevy et al, which highlights the fact that
simple models and a lot of data trump more elaborate models based on less data
Now, they are talking about natural language processing using trillion-word corpora. Not exactly the situation we face in drug discovery! But, it does look like we’re slowly going in the direction of generating biological datasets of large size and of multiple types. A recent NIH RFP proposes this type of development. Coupled with well established machine learning methods, this could be lead to some very interesting developments. (Of course even ‘simple’ properties such as solubility could benefit from a ‘large data’ scenario as noted by Muchmore et al).
Overall, two interesting papers looking at the state of the field from different views.
I’ve put out an updated version (1.0.1) of the CDK descriptor calculator that now supports drag ‘n drop of the input file – just drag an appropriate file onto the UI and the input file text field should be automatically populated. In addition, all file dialogs let OS X users specify a file name manually.
The current version also supports a, frequently requested, command line batch mode. It’s a little limited compared to the GUI since you can’t specify individual descriptors, only descriptor categories (such as ‘all’, ‘topological’ etc) and the only output format is tab delimited.
$ java -jar CDKDescUI.jar -h
usage: cdkdescui [OPTIONS] inputfile
-b Batch mode
-o Output file
-t Descriptor type: all, topological, geometric, constitutional,
-v Verbose output
CDKDescUI v1.0.1 Rajarshi Guha <firstname.lastname@example.org>
By default, output is dumped to output.txt and all descriptors are evaluated. If errors occur for a given molecule and descriptor they are reported at the end (i.e., the program continues)