Archive for the ‘substructure’ tag
Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.
The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.
The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.