Archive for the ‘performance’ tag
A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.
I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .
“Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”
Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.
Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.
Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster multi-threaded queries.
The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.
The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.
In my last post I had reported some timing measurements for various operations. One of them was fingerprinting using the path-based hashing Fingerprinter class in the CDK. As reported, it took nearly 4 minutes to process a 1000-molecule subset of ZINC. Not good.
So I spent a little time last night hacking on the code, primarily making the search for unique paths a little faster. Happily, my latest commit (in 1.2.x, should be merged into trunk soon) allows the fingerprinter to process 1000 molecules in approximately 59s – a 4X speed up.
In terms of behavior, the new code gets the exact same paths as the old code, the only difference being that the order of atoms in the path can be reversed. Since the fingerprint is generated by hashing “path strings”, this means that the fingerprints from the new code will differ slightly from the old code. So if you’re working witha bunch of fingerprints calculated with the old code, you should probably regenarate them with the new code.