Faster Maximum Common Substructures

Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.

1
2
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CN(CC)CC
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CNCCNC

While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.

15 thoughts on “Faster Maximum Common Substructures

  1. How does it compare to the vflib implementation?

  2. Haven’t looked at that – I thought that the vflib code was oriented towards substructure detecion rather than MCSS

  3. Yes, you might be right about that…

    (My captha for today: “15-year-old” “corpse” …. mmmm)

  4. gilleain says:

    I’m wondering what better highlighting should look like (I’m assuming that you used the new jcp code! :)

    One great visual effect I saw in some commercial code at GCC. It was selection looking like a cross-section of a surface. Basically a curved path around connected atoms and bonds.

    It could be crudely reproduced by drawing very fat lines and medium sized circles underneath like this:

    http://gilleain.tumblr.com/post/80100724

    which doesn’t look as good, but still…

  5. Yes, I was using the code from a few days ago. I’m probably not doing the highlighting properly – I kept a non-zero highlight radius since without it, the atoms in a substructure are not colored.

    The image you linked to is not too bad – one thing I’d like is to not have the circles on the atom positions.

    Also, if I do use a non-zero highlight radius, how does one get it to be drawn under the atom symbol?

  6. gilleain says:

    Well, it might have been our bugs, rather than you not doing it properly!

    The whole area of highlighting and selection needs to be updated, as I found out when getting this code to work.

    Circles are not mandatory. The code to do this is here:

    http://gist.github.com/68111

    although it might not work until I’ve checked in some changes. Real bleeding edge…

    The more I think about it, the more I like the idea you mentioned of chemical stylesheets. If you could say something like:

    selection.atomRadius = 0
    selection.bondWidth = 5
    selection.color = yellow

    and so on, it would be much easier.

    Oh, and as to the question of drawing it underneath the symbol, the order of the generators determines the order of drawing. Effectively, each layer of the diagram is made by one generator.

  7. Aah, thanks for the code – it’s cute :) But I now see how I can basically have my own renderer, which is very useful (and nice design!)

    Regarding the idea f stylesheets, what you wrote is probably the same as going via the RendererModel. What’d be nice is to create a RendererModel based on such settings in a an external ‘stylesheet’ document

  8. Andrew Dalke says:

    In case you’re curious, I tried these structures using OEChem. It found both MCSes in 1.2 seconds on my 2yo MacBook Pro laptop, so a time of “a few seconds” is respectable.

    For your visualization, do you align the MCSes so they are the same? That would help see the matches.

  9. No, they aren’t aligned. I don’t think the CDK has that capability for 2D, but it should be a simple extension of the 3D code

  10. bekir says:

    Hi,

    I am new to the chemoinformatics and I want to write some software which uses chemical fingerprints. Usually I use C++ but I couldn’t find a development kit as good as CDK for C++, so I will also try to learn some Java.

    What I would like to do is find common fingerprints(Pubchem) in a set of about 50 compounds(drugs) and see these fingerprints highlighted on the 2D thumbnail views of the compounds at the same time. I am hoping to use this to find key common substructures in a set of compounds.

    So, is it possible to do this by using CDK? If so which classes, functions should I use?

    Thanks in advance,
    Bekir

    PS I don’t know if this is the right place for asking this question but I couldn’t find any place that I can post a question on CDK’s website. So please excuse me for being irrelevant, if I am…

  11. Hi Bekir, yes generating fingerprints is relatively easy but then highlighting the fingerprints on the 2D structure may not be trivial. If you use hashed fingerprints, this is not possible at all. If you use the structural key FP’s then it is possible, but not directly since I don’t think you can access the SMARTS via the relevant fingerprint classes.

    It’d be best to ask this on the cdk-user mailing list

  12. bekir says:

    Thank you so much.

  13. bekir says:

    Hi Rajarshi,
    I gave up on fingerprints and decided to use MCS. I tried your simple gui of SMSD and it works like a charm. I would like to know if there is any way to get the calculated common substructure as an output, for example in SMILES format? And is this gui an opensource?

  14. Hi bekir, the source code is at http://tinyurl.com/l3xtkw – however it will require a modified version of the CDK for the depiction (as well as the SMSD jar)

  15. Asad says:

    SMSD paper— Small Molecule Subgraph Detector (SMSD) toolkit http://www.jcheminf.com/content/1/1/12

    Raj thanks for highlighting the tool. Please try the latest build.

Leave a Reply to Rajarshi Guha Cancel reply

Your email address will not be published. Required fields are marked *