In my last post I had reported some timing measurements for various operations. One of them was fingerprinting using the path-based hashing Fingerprinter class in the CDK. As reported, it took nearly 4 minutes to process a 1000-molecule subset of ZINC. Not good.
So I spent a little time last night hacking on the code, primarily making the search for unique paths a little faster. Happily, my latest commit (in 1.2.x, should be merged into trunk soon) allows the fingerprinter to process 1000 molecules in approximately 59s – a 4X speed up.
In terms of behavior, the new code gets the exact same paths as the old code, the only difference being that the order of atoms in the path can be reversed. Since the fingerprint is generated by hashing “path strings”, this means that the fingerprints from the new code will differ slightly from the old code. So if you’re working witha bunch of fingerprints calculated with the old code, you should probably regenarate them with the new code.
[…] to work on the chemistry search engine for our new chemogenomics data, has given Rajarshi’s new fingerprint implementation a test. Mark was bored to hell by the performance of the version he had in hand and it turned out […]
[…] I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK […]
[…] 4, 2008 by Rajarshi Guha A while back I wrote about some improvements I had made to the CDK fingerprinting code to improve performance. Recently […]