Yesterday, Andreas Maunz asked a question on the openbabel-discuss list:
… a possibility to combine two distinct smarts patterns. Of course, there is the comma OR operator (s1,s2), but I was thinking of a more sophisticated combination, where equal parts of s1 and s2 are “mapped onto each other” in a way such that merely the exclusive parts of s1 and s2 are connected via OR to the common core structure, where possible?
Fundamentally, a solution requires us to be able to manipulate and transform the SMARTS query in various ways. Operations of these types have been been discussed by Roger Sayle in a MUG presentation and these are available in the OpenEye tools. As far as I know, there are no open source implementations of such SMARTS optimizations (or manipulations). Later on in Andres’ thread, Andrew Dalke suggested that one could attack this by performing an MCSS calculation on two graphs representing the two queries, along with some appropriate normalizations.
It turns out that this type of manipulation isn’t too difficult in the CDK, thanks to Dazhi Jiao’s work on the JJTree based SMARTS parser. Since the parser generates an abstract syntax tree, the SmartQueryVisitor is able to walk the tree and generate an IQueryAtomContainer. Basically this is analogous to a molecule (i.e., IAtomContainer) and hence we can apply pre-exisiting code for substructure searches. Thus, to find the largest common query fragment between two SMARTS queries we can do:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | import org.openscience.cdk.smiles.smarts.parser.SMARTSParser; import org.openscience.cdk.smiles.smarts.parser.ASTStart; import org.openscience.cdk.smiles.smarts.parser.ParseException; import org.openscience.cdk.smiles.smarts.parser.visitor.SmartsQueryVisitor; import org.openscience.cdk.isomorphism.matchers.IQueryAtomContainer; import org.openscience.cdk.isomorphism.UniversalIsomorphismTester; import org.openscience.cdk.interfaces.IAtom; import org.openscience.cdk.interfaces.IAtomContainer; import org.openscience.cdk.interfaces.IBond; import org.openscience.cdk.exception.CDKException; import java.util.List; public class SMARTSManipulator { public SMARTSManipulator() { } private IQueryAtomContainer getQueryMolecule(String smarts) throws ParseException { SMARTSParser parser = new SMARTSParser(new java.io.StringReader(smarts)); ASTStart ast = parser.Start(); SmartsQueryVisitor visitor = new SmartsQueryVisitor(); return (IQueryAtomContainer) visitor.visit(ast, null); } // helper to convert the IQueryAtomContainer to an IAtomContainer, to make the UIT // happy. This is OK, since the atoms and bonds in the IAtomContainer are still the query // atoms and query bonds private IAtomContainer convertToAtomContainer(IQueryAtomContainer queryContainer) { IAtomContainer x = queryContainer.getBuilder().newAtomContainer(); for (IAtom atom : queryContainer.atoms()) x.addAtom(atom); for (IBond bond : queryContainer.bonds()) x.addBond(bond); return x; } public void smartsMCSS() throws ParseException, CDKException { IQueryAtomContainer queryTarget = getQueryMolecule("c1ccccc1C(=O)C*"); IQueryAtomContainer queryQuery = getQueryMolecule("*C([N,S])C=O"); boolean isSubgraph = UniversalIsomorphismTester.isSubgraph(convertToAtomContainer(queryTarget), queryQuery); System.out.println("isSubgraph = " + isSubgraph); System.out.println(""); List<IAtomContainer> mcss = UniversalIsomorphismTester.getOverlaps( convertToAtomContainer(queryTarget), convertToAtomContainer(queryQuery)); System.out.println("mcss.size() = " + mcss.size() + "\n"); for (IAtomContainer aMcs : mcss) { System.out.println("aMcs.getAtomCount() = " + aMcs.getAtomCount()); for (IAtom matchingAtom : aMcs.atoms()) System.out.println("matchingAtom = " + matchingAtom); System.out.println(""); } } public static void main(String[] args) throws ParseException, CDKException { SMARTSManipulator m = new SMARTSManipulator(); m.smartsMCSS(); } } |
Note that we convert the IQueryAtomContainer to IAtomContainer, since the UniversalIsomorphism tester expects that we are performing queries against an actual (i.e., real) molecule.
As expected the smaller query is not a subgraph of the larger query. But when we evaluate the MCSS, we do end up with a query fragment consisting of *CC=O.
Now this solution is just a quick hack to see if it could be done – it will fail on more general queries (such as C[#6]N and CCN) since we don’t perform any normalization. Furthermore, there are more sophisticated optimizations that could be done on the AST directly but that’s for another night
Great, thank you! 😉 I have not yet decided how to solve the problem.
Your suggestion seems a first step, applicable to general patterns.
However, I do control the way in which the SMARTS are generated, which might be useful (as AD pointed out in the discussion-thread). Also, my fragments are non-cyclic. Therefore, I might rely on simpler solutions. The common approach of MCS calculation, however, will remain.
Greetings, Andreas
Sounds interesting. Hope you keep us updated on your progress
You can now find my solution on cs.maunz.de as a paper called “Latent Structure Pattern Mining”. I exploit the canonical depth sequence formulation of the graph miner “Gaston”.
However, your above solution seems appealing for a viewer application based on CDK that we currently build.