Archive for the ‘isomorphism’ tag
Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.
While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.
Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.
The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.
The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.