So much to do, so little time

Trying to squeeze sense out of chemical data

Faster Maximum Common Substructures

with 15 comments

Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.


While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.

Written by Rajarshi Guha

February 19th, 2009 at 8:29 pm

Posted in software,cheminformatics

Tagged with , , , ,

15 Responses to 'Faster Maximum Common Substructures'

Subscribe to comments with RSS or TrackBack to 'Faster Maximum Common Substructures'.

  1. How does it compare to the vflib implementation?

    Egon Willighagen

    19 Feb 09 at 8:55 pm

  2. Haven’t looked at that – I thought that the vflib code was oriented towards substructure detecion rather than MCSS

    Rajarshi Guha

    19 Feb 09 at 9:01 pm

  3. Yes, you might be right about that…

    (My captha for today: “15-year-old” “corpse” …. mmmm)

    Egon Willighagen

    19 Feb 09 at 9:48 pm

  4. I’m wondering what better highlighting should look like (I’m assuming that you used the new jcp code! :)

    One great visual effect I saw in some commercial code at GCC. It was selection looking like a cross-section of a surface. Basically a curved path around connected atoms and bonds.

    It could be crudely reproduced by drawing very fat lines and medium sized circles underneath like this:

    which doesn’t look as good, but still…


    21 Feb 09 at 12:34 am

  5. Yes, I was using the code from a few days ago. I’m probably not doing the highlighting properly – I kept a non-zero highlight radius since without it, the atoms in a substructure are not colored.

    The image you linked to is not too bad – one thing I’d like is to not have the circles on the atom positions.

    Also, if I do use a non-zero highlight radius, how does one get it to be drawn under the atom symbol?

    Rajarshi Guha

    21 Feb 09 at 4:51 pm

  6. Well, it might have been our bugs, rather than you not doing it properly!

    The whole area of highlighting and selection needs to be updated, as I found out when getting this code to work.

    Circles are not mandatory. The code to do this is here:

    although it might not work until I’ve checked in some changes. Real bleeding edge…

    The more I think about it, the more I like the idea you mentioned of chemical stylesheets. If you could say something like:

    selection.atomRadius = 0
    selection.bondWidth = 5
    selection.color = yellow

    and so on, it would be much easier.

    Oh, and as to the question of drawing it underneath the symbol, the order of the generators determines the order of drawing. Effectively, each layer of the diagram is made by one generator.


    21 Feb 09 at 6:20 pm

  7. Aah, thanks for the code – it’s cute :) But I now see how I can basically have my own renderer, which is very useful (and nice design!)

    Regarding the idea f stylesheets, what you wrote is probably the same as going via the RendererModel. What’d be nice is to create a RendererModel based on such settings in a an external ‘stylesheet’ document

    Rajarshi Guha

    21 Feb 09 at 7:25 pm

  8. In case you’re curious, I tried these structures using OEChem. It found both MCSes in 1.2 seconds on my 2yo MacBook Pro laptop, so a time of “a few seconds” is respectable.

    For your visualization, do you align the MCSes so they are the same? That would help see the matches.

    Andrew Dalke

    23 Feb 09 at 1:55 pm

  9. No, they aren’t aligned. I don’t think the CDK has that capability for 2D, but it should be a simple extension of the 3D code

    Rajarshi Guha

    23 Feb 09 at 6:45 pm

  10. Hi,

    I am new to the chemoinformatics and I want to write some software which uses chemical fingerprints. Usually I use C++ but I couldn’t find a development kit as good as CDK for C++, so I will also try to learn some Java.

    What I would like to do is find common fingerprints(Pubchem) in a set of about 50 compounds(drugs) and see these fingerprints highlighted on the 2D thumbnail views of the compounds at the same time. I am hoping to use this to find key common substructures in a set of compounds.

    So, is it possible to do this by using CDK? If so which classes, functions should I use?

    Thanks in advance,

    PS I don’t know if this is the right place for asking this question but I couldn’t find any place that I can post a question on CDK’s website. So please excuse me for being irrelevant, if I am…


    11 Jun 09 at 7:44 am

  11. Hi Bekir, yes generating fingerprints is relatively easy but then highlighting the fingerprints on the 2D structure may not be trivial. If you use hashed fingerprints, this is not possible at all. If you use the structural key FP’s then it is possible, but not directly since I don’t think you can access the SMARTS via the relevant fingerprint classes.

    It’d be best to ask this on the cdk-user mailing list

    Rajarshi Guha

    17 Jun 09 at 2:05 am

  12. Thank you so much.


    17 Jun 09 at 2:40 pm

  13. Hi Rajarshi,
    I gave up on fingerprints and decided to use MCS. I tried your simple gui of SMSD and it works like a charm. I would like to know if there is any way to get the calculated common substructure as an output, for example in SMILES format? And is this gui an opensource?


    14 Jul 09 at 1:37 pm

  14. Hi bekir, the source code is at – however it will require a modified version of the CDK for the depiction (as well as the SMSD jar)

    Rajarshi Guha

    15 Jul 09 at 2:20 pm

  15. SMSD paper— Small Molecule Subgraph Detector (SMSD) toolkit

    Raj thanks for highlighting the tool. Please try the latest build.


    13 Aug 09 at 11:20 pm

Leave a Reply