As part of a larger project, I’ve been doing some profiling on various aspects of the CDK, focusing on core cheminformatics operations. I’m using the excellent YourKit profiler to do the tests. They tests are run on a Macbook Pro (2.16GHz) with 1GB RAM, using the latest trunk version of the CDK and JDK 1.5.
The test data is a 1000-molecule subset take from the ZINC collection. The operations I’ve been looking at are
The test harness simply reads the 1000 molecules one by one and performs the operation in question. For certain tasks which are not atomic in nature, the code does a little more but the timing is measured only for the operation under study. In all cases, things like loading molecules from disk are not measured. The whole process is repeated 10 times and the times reported are the average of the 10 runs. A brief overview of the results:
Since we’re coming up to a 1.2 release (see Egons post) I’ve put up a nightly build site for the 1.2.x branch here so that we can track improvemens in the JUnit tests and various other code and documentation quality issues.
I just updated the CDK Nightly build script so that it summarizes the state of unit test coverage. Currently, trunk has a total of 3215 methods (in 378 classes) that are missing unit tests. See the JUnit test summary for a module-wise summary.
I’m in academia and I do cheminformatics. Recent collaborations, papers and funding issues in this field have made me think about the future of this research in this setting. This, and a thread discussing David Leahy’s talk on InkSpot Science at the Soton Open Science Workshop got me started on this post.
There are currently a number of groups and collaborations that are attempting to perform drug discovery without the large centralized infrastructure that is characteristic of this process. Examples of this include Jean Claude Bradley who runs the UsefulChem project and the Synaptic Leap as well as various academic labs. Also see Kozikowski et al
Cheminformatics plays a key role in drug discovery efforts at various stages. For example, identifying or prioritizing compounds from virtual libraries, predicting ADME profiles and side effects (e.g., hERG activation) and so on. I should stress that such computational methods don’t replace bench work – but they can certainly enhance it. More generally, we’re now faced with a deluge of data – and human eyeballs are not going to be able to handle this. And this is exactly the place that cheminformatics does it’s stuff.
Houghten, R. et al, “Strategies for the Use of Mixture-Based Synthetic Combinatorial Libraries: Scaffold Ranking, Direct Testing In Vivo, and Enhanced Deconvolution by Computational Methods”, J. Comb. Chem., 2008, 10, 3-19
Recently a collaborator pointed me to the above article by Houghten and co-workers where they describe the use of mixture-based combinatorial libraries for high-throughput screening (HTS) experiments.
Traditionally an HTS experiment will screen thousands to millions of individual molecules. Obviously, it’s all done by robots so though you have to be careful during setup it’s not like you have to do it all by hand. But the fact is, if it’s possible to reduce the actual number of individual screens, life becomes easier and cheaper. Houghten et al describe an elegant approach that does just this.