A while back I wrote about some updates I had made to the CDK fingerprinting code to improve performance. Recently Egon and Jonathan Alvarsson (Uppsala) had made even more improvements. Some of them are simple fixes (making a String[] final, using Set rather than List) while others are more significant (efficient caching of paths). In combination, they have improved performance by over 50%, compared to my last update. Egon has put up a nice summary of performance runs here. Excellent work guys!
Jonathan did all this work… I only set up a GDocs spreadsheet to show him how to share his electronic lab notebook
One big remaining bottleneck is deep inside the CDK classes: getConnectedAtomsList()… this is because it has to iterate over all bonds to find which atom is connected. That scale as O(n)… Instead, while a more memory consuming, I’d really like to see atoms have pointers to attached bonds… This adds a few pointers to the atom, 4 byte each(?), but converts the getConnectedAtomsList() scalability to O(4).
How do you attach pointers? What about restructuring and adding a hash table to each atom, listing connected atoms?
Adding a private, non-getsetable Map<IAtom,List> to AtomContainer would do the job… then we would not even need to change the interfaces…
But, we need first a good benchmark set…. a set of tasks any cheminformatics toolkit could do, by which we could compare performance…
Good idea – benchmark should be easy. Find high level classes that use getConnectedAtoms and benchmark them. Likely, many descriptors will use this function.
Egon and Rajarshi, good points about iterating over all bonds to find a connection. In MX, every atom maintains a List of its neighbors. For some reason, MX still iterates over all bonds to find a connection between two atoms with Molecule.getBond(Atom, Atom):
http://tinyurl.com/63t64v
Easy enough to fix, though.
BTW, Rajarshi, does your comments system allow markup such as ?