What Has Cheminformatics Done for You Lately?

Recently there have been two papers asking whether cheminformatics or virtual screening in general, have really helped drug discovery, in terms of lead discovery.

The first paper from Muchmore et al focuses on the utility of various cheminformatics tools in drug discovery.  Their report is retrospective in nature where they note that while much research has been done in developing descriptors and predictors of various molecular properties (solubility, bioavilability etc), it does not seem that this has contributed to increased productivity. They suggest three possible reasons for this

  • not enough time to judge the contributions of cheminformatics methods
  • methods not being used properly
  • methods themselves not being sufficiently accurate.

They then go on consider how these reasons may apply to various cheminformatics methods and tools that are accessible to medicinal chemists. Examples range from molecular weight and ligand efficiency to solubility, similarity and bioisosteres. They use a 3-class scheme – known knowns, unknown knowns and unknown unknowns corresponding to methods whose underlying principles are whose results can be robustly interpreted, methods for properties that we don’t know how to realistically evaluate (but which we may still do so – such as solubility) and methods for which we can get a numerical answer but whose meaning or validity is doubtful. Thus for example, ligand binding energy calculations are placed in the “unknown unknown” category and similarity searches are placed in the “known unknown” category.

It’s definitely an interesting read, summarizing the utility of various cheminformatics techniques. It raises a number of interesting questions and issues. For example, a recurring issue is that many cheminformatics methods are ultimately subjective, even though the underlying implementation may be quantitative – “what is a good Tanimoto cutoff?” in similarity calculations would be a classic example.  The downside of the article is that it does appear at times to be specific to practices at Abbott.

The second paper is by Schneider and is more prospective and general in nature and discusses some reasons as to why virtual screening has not played a more direct role in drug discovery projects. One of the key points that Schneider makes is that

appropriate “description of objects to suit the problem” might be the key to future success

In other words, it may be that molecular descriptors, while useful surrogates of physical reality, are probably not sufficient to get us to the next level. Schneider even states that “… the development of advanced virtual screening methods … is currently stagnated“. This statement is true in many ways, especially if one considers the statistical modeling side of virtual screening (i.e., QSAR). Many recent papers discuss slight modifications to well known algorithms that invariably lead to an incremental improvement in accuracy. Schneider suggests that improvements in our understanding of the physics of the drug discovery problem – protein folding, allosteric effects, dynamics of complex formation, etc – rather than continuing to focus on static properties (logP etc) will lead to advances. Another very valid point is that future developments will need to move away from the prediction or modeling of “… one to one interactions between a ligand and a single target …”  and instead will need to consider “… many to many relationships …“. In other words, advances in virtual screen will address (or need to address) the ligand non-specificity or promiscuity. Thus activity profiles, network models and polyparmacology will all be vital aspects of successful virtual screening.

I really like Schneiders views on the future of virtual screening, even though they are rather general. I agree with his views on the stagnation of machine learning (QSAR) methods but at the same time I’m reminded of a paper by Halevy et al, which highlights the fact that

simple models and a lot of data trump more elaborate models based on less data

Now, they are talking about natural language processing using trillion-word corpora. Not exactly the situation we face in drug discovery! But, it does look like we’re slowly going in the direction of generating biological datasets of large size and of multiple types. A recent NIH RFP proposes this type of development. Coupled with well established machine learning methods, this could be lead to some very interesting developments. (Of course even ‘simple’ properties such as solubility could benefit from a ‘large data’ scenario as noted by Muchmore et al).

Overall, two interesting papers looking at the state of the field from different views.

CDKDescUI Updates – DnD & Batch Mode

I’ve put out an updated version (1.0.1) of the CDK descriptor calculator that now supports drag ‘n drop of the input file – just drag an appropriate file onto the UI and the input file text field should be automatically populated. In addition, all file dialogs let OS X users specify a file name manually.

The current version also supports a, frequently requested, command line batch mode. It’s a little limited compared to the GUI since you can’t specify individual descriptors, only descriptor categories (such as ‘all’, ‘topological’ etc) and the only output format is tab delimited.

1
2
3
4
5
6
7
8
9
10
11
12
$ java -jar CDKDescUI.jar -h

usage: cdkdescui [OPTIONS] inputfile
                 
 -b    Batch mode
 -h    Help
 -o    Output file
 -t    Descriptor type: all, topological, geometric, constitutional,
       electronic, hybrid
 -v    Verbose output

CDKDescUI v1.0.1 Rajarshi Guha <rajarshi.guha@gmail.com>

By default, output is dumped to output.txt and all descriptors are evaluated. If errors occur for a given molecule and descriptor they are reported at the end (i.e., the program continues)

Another ACS Done

Finally back home from another ACS National Meeting, this time in San Francisco. While the location is certainly an attraction, there was some pretty nice talks and symposia in the CINF division such as the Visualization of Chemical Data, Metabolomics and Materials Informatics. Credit for these (and all the other) symposia go to the organizers who put in a lot of effort to get an excellent line up of speakers – as evidenced by packed rooms. This time, I finally got round to visiting some of the other division – some excellent talks in MEDI.  As in the past, there was a Blue Obelisk dinner, this time at La Briciola (a fantastic recommendation from Moses Hohman and the CDD crowd) where there was much good discussion.  I got a Blue Obelisk Obelisk from PMR (Cameron Neylon and Alex Wade were also recipients this year).

CINF had some excellent receptions where I got to meet old faces and make some new friends – with many of whom I’ve actually had many virtual exchanges via email or Friendfeed. Here’s a picture of me and Wendy Warr from one of the receptions.

With the meeting over and most of the follow up now, I can take a bit of a break while the last few submissions for the Boston program come trickling in. And then I get down to finalizing the program for the Fall meeting. This fall, we have an excellent line up of symposia including “Data Intensive Drug Design“, “Semantic Chemistry and RDF” and “Structure Activity Landscapes“. At the Fall meeting, I’ll also be chairing a COMP symposium titled “HPC on the Cheap” where an excellent set of speakers will be focusing on various technologies that let users access high performance computing power at a fraction of the price of super computers – stuff like FPGA’s, GPU’s and distributed systems such as Hadoop. This is part of the “Scripting and Programming” series, so expect to see code on the slides!

I’d also like to let people know that in Boston, CINF will be running an experimental symposium consisting of several very short (5 minutes or 8 minutes) lightning talks. But unlike traditional ACS symposia, we’re going to open submissions to this symposia sometime in July and close about 3 or 2 weeks before the meeting itself. In other words, we’re going to be looking for recent and ongoing developments in chemical information and cheminformatics. The title and exact mechanics of this symposium – dates, submissions, reviews and the actual times, slide counts will be announced in the near future at various places. If you think the early ACS deadlines suck, consider submitting a short talk to this symposium.

Overall an excellent meeting in San Francisco and I’m already looking forward to Boston. But in the meantime, time to get back to chewing on data, and finishing up some papers, book chapters and talks.