Faster Substructure Search in the CDK

The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.

The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.

The results of some initial performance tests are listed below. I timed (using System.currentTimeMillis) a 1000 runs of the substructure search using each of the methods. This test only considered substructure searching using connectivity, ignoring atom and bond characteristics. A

Target Query Ullman Time (ms) UIT Time (ms)
C1CCCCC1C(C)(C)(C) CC(C)(C)(C) 19 653
C1CCCCC1C(C)(C)(C) 38 1599
C1CCCCC1C2CCCCC2 C(C)(C)C(C)(C) 24 950
C1CCCCC1C2CCCCC2 49 4601
C1CCC2(CC1)OCCO2 C1CCCC1 20 1185
C1CCCC1(C)(C) 27 1825
C1CCCC1(CC)(CC) 35 2499

While the results look very impressive, the UIT code is finding all subgraph mappings, whereas the Ullman implementation is returning after finding the first one. For certain purposes this is useful (say for something like obgrep), but is obviously not a general solution. It’s also interesting to note that I’m not using the refinement step described by Ullman and implemented in VFLib,. With that I might be able to shave off a few more milliseconds.

Right now the code is not integrated into the CDK – before that it’ll need more work to get all mappings found and extensive testing to check for correctness (at least it differentiates between cyclopropane and isobutane!).

8 thoughts on “Faster Substructure Search in the CDK

  1. Rich Apodaca says:

    Rajarshi, excellent stuff. Can the code be downloaded?

    Back in the day, I implemented Ullman in Octet (a Java framework for cheminformatics):

    http://octet.cvs.sourceforge.net/octet/octet/src/net/sf/octet/traversal/UllmanIsomorphismTraverser.java?revision=1.2&view=markup

    It uses matrices, but I always thought it would be cleaner and more understandable to use objects. I remember how fun it was to get the recursive backtrack method working right – and how amazed I was when it finally did work.

    This was based on code found in the Graph Visualization Framework:

    http://gvf.sf.net

  2. Further things affected will be database searches. I know this is obvious but there are currently a couple of database projects being equipped with a CDK-based substructure search and this will be a big step forward.
    I think we at EBI would like to be part of the testing team.

  3. Thierry Hanser says:

    Hello Rajarshi,

    It is great news to see an Ullmann implementation joining the CDK toolkit. I am not surprised that it is faster than the UIT tool since the Ullmann algorithm is specifically designed for substructure search.

    It is worth reminding that the UIT was designed for maximum common substructure search (MCSS) wich has a much larger search space and is actually a quite different problem. Substructure search and isomorphism search were only by-products of the UIT algorithm. Having said that I appreciate there is still a lot of room for improvement of the UIT algorithm or its implementation; I think we could at least win a factor 2 without out too much trouble (unfortunately there is no time available on my side for that :-( ). I am pretty sure there should although be a way to “inject” the Ullmann relaxation technic inside the UIT algorithm to improve significantly it’s performance.

    Now that CDK benefits from a specialized substructure search algorithm it might be useful to redirect to the correct algorithm according to the type of search so that the en user does not have to worry about the underlying algorithms when using the high-level API (for instance the UIT API could be kept using the RGraph based algorithm only for MCSS search and the Ullmann algorithm otherwise). This could be a way to provide backwards compatibility

    Thanks Rajarshi for this useful contribution that CDK needed.

    Thierry

  4. Thierry – thanks for the observations. I completely agree and the Ullman implenentation is not intended to replace the current implementation.

    I was also thinking on the lines of your note – update the code so that your code is just applied for MCSS and general substructure searches are done by Ullman.

  5. Christoph, re database searches yes. After I get it cleaned up it’ll go into a branch and then you’d be free to play with it

  6. Rich, sorry for the late reply – your comment had gone to spam and I hadn’t looked at it in some time.

    The code isn’t available for download at this point (though I can send it if you’d like), since it suffers from a problem. VFLib differentiates between subgraph isomorphism and monomorphism. As a result their Ullman implementation says that CCC is *not* a substructure of C1CC1.

    I’m working on translating the VF2 monomorphism algorithm which does work as expected. Use of pointers don’t help the translation process :(

  7. […] 18, 2008 by Rajarshi Guha Sometime back I described how I was porting the VFLib algorithms to Java so that we could use it for substructure search, since the current […]

  8. […] since that would imply NP=coNP. The graph isomorphism problem does not yet have a polynomial …Faster Substructure Search in the CDK at So much to do, so …The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s […]

Leave a Reply to Thierry Hanser Cancel reply

Your email address will not be published. Required fields are marked *