So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘mcss’ tag

Manipulating SMARTS Queries

with 3 comments

Yesterday, Andreas Maunz asked a question on the openbabel-discuss list:

… a possibility to combine two distinct smarts patterns. Of course, there is the comma OR operator (s1,s2), but I was thinking of a more sophisticated combination, where equal parts of s1 and s2 are “mapped onto each other” in a way such that merely the exclusive parts of s1 and s2 are connected via OR to the common core structure, where possible?

Fundamentally, a solution requires us to be able to manipulate and transform the SMARTS query in various ways. Operations of these types have been been discussed by Roger Sayle in a MUG presentation and these are available in the OpenEye tools. As far as I know, there are no open source implementations of such SMARTS optimizations (or manipulations). Later on in Andres’ thread, Andrew Dalke suggested that one could attack this by performing an MCSS calculation on two graphs representing the two queries, along with some appropriate normalizations.

It turns out that this type of manipulation isn’t too difficult in the CDK, thanks to Dazhi Jiao’s work on the JJTree based SMARTS parser. Since the parser generates an abstract syntax tree, the SmartQueryVisitor is able to walk the tree and generate an IQueryAtomContainer. Basically this is analogous to a molecule (i.e., IAtomContainer) and hence we can apply pre-exisiting code for substructure searches. Thus, to find the largest common query fragment between two SMARTS queries we can do:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import org.openscience.cdk.smiles.smarts.parser.SMARTSParser;
import org.openscience.cdk.smiles.smarts.parser.ASTStart;
import org.openscience.cdk.smiles.smarts.parser.ParseException;
import org.openscience.cdk.smiles.smarts.parser.visitor.SmartsQueryVisitor;
import org.openscience.cdk.isomorphism.matchers.IQueryAtomContainer;
import org.openscience.cdk.isomorphism.UniversalIsomorphismTester;
import org.openscience.cdk.interfaces.IAtom;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.interfaces.IBond;
import org.openscience.cdk.exception.CDKException;

import java.util.List;

public class SMARTSManipulator {

    public SMARTSManipulator() {
    }

    private IQueryAtomContainer getQueryMolecule(String smarts) throws ParseException {
        SMARTSParser parser = new SMARTSParser(new java.io.StringReader(smarts));
        ASTStart ast = parser.Start();
        SmartsQueryVisitor visitor = new SmartsQueryVisitor();
        return (IQueryAtomContainer) visitor.visit(ast, null);
    }

    // helper to convert the IQueryAtomContainer to an IAtomContainer, to make the UIT
    // happy. This is OK, since the atoms and bonds in the IAtomContainer are still the query
    // atoms and query bonds
    private IAtomContainer convertToAtomContainer(IQueryAtomContainer queryContainer) {
        IAtomContainer x = queryContainer.getBuilder().newAtomContainer();
        for (IAtom atom : queryContainer.atoms()) x.addAtom(atom);
        for (IBond bond : queryContainer.bonds()) x.addBond(bond);
        return x;
    }

    public void smartsMCSS() throws ParseException, CDKException {

        IQueryAtomContainer queryTarget = getQueryMolecule("c1ccccc1C(=O)C*");
        IQueryAtomContainer queryQuery = getQueryMolecule("*C([N,S])C=O");


        boolean isSubgraph = UniversalIsomorphismTester.isSubgraph(convertToAtomContainer(queryTarget), queryQuery);
        System.out.println("isSubgraph = " + isSubgraph);
        System.out.println("");

        List<IAtomContainer> mcss = UniversalIsomorphismTester.getOverlaps(
                convertToAtomContainer(queryTarget),
                convertToAtomContainer(queryQuery));

        System.out.println("mcss.size() = " + mcss.size() + "\n");

        for (IAtomContainer aMcs : mcss) {
            System.out.println("aMcs.getAtomCount() = " + aMcs.getAtomCount());
            for (IAtom matchingAtom : aMcs.atoms())
                System.out.println("matchingAtom = " + matchingAtom);
            System.out.println("");
        }
    }

    public static void main(String[] args) throws ParseException, CDKException {
        SMARTSManipulator m = new SMARTSManipulator();
        m.smartsMCSS();
    }
}

Note that we convert the IQueryAtomContainer to IAtomContainer, since the UniversalIsomorphism tester expects that we are performing queries against an actual (i.e., real) molecule.

As expected the smaller query is not a subgraph of the larger query. But when we evaluate the MCSS, we do end up with a query fragment consisting of *CC=O.

Now this solution is just a quick hack to see if it could be done – it will fail on more general queries (such as C[#6]N and CCN) since we don’t perform any normalization. Furthermore, there are more sophisticated optimizations that could be done on the AST directly but that’s for another night

Written by Rajarshi Guha

March 7th, 2009 at 7:10 am

Faster Maximum Common Substructures

with 15 comments

Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.

1
2
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CN(CC)CC
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CNCCNC

While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.

Written by Rajarshi Guha

February 19th, 2009 at 8:29 pm

Posted in cheminformatics,software

Tagged with , , , ,