# So much to do, so little time

Trying to squeeze sense out of chemical data

## Another Oracle Structure Search Cartridge

without comments

I came across an ASAP paper today describing substructure searching in Oracle databases. The paper comes from the folks at J & J and is part of their series of papers on the ABCD platform. Performing substructure searches in databases is certainly not a new topic and various products are out there that support this in Oracle (as well as other RDBMSs). The paper describes how the ABCD system does this using a combination of structure-derived hash keys and an inverted bitset based index and discuss their implementation as an Oracle cartridge. They provide an interesting discussion of how their implementation supports Cost Based Optimization of SQL queries involving substructure search. The authors run a number of benchmarks. In terms of comparative benchamrks they compare the performance (i.e., screening efficiency) of their hashed keys versus MACCS keys, CACTVS keys and OpenBabel FP2 fingerprints. Their results indicate that the screening step is a key bottleneck in the query process and that their hash key is generally more selective than the others.

Unfortunately, what would have been interesting but was not provided was a comparison of the performance at the Oracle query level with other products such as JChem Cartridge and OrChem. Furthermore, the test case is just under a million molecules from Golovin & Henrick – the entire dataset (not just the keys) could probably reside in-memory on todays servers. How does the system perform when say faced with PubChem (34 million molecules)? The paper mentions a command line implementation of their search procedure, but as far as I can tell, the Oracle cartridge is not available.

The ABCD system has many useful and interesting features. But as with the other publications on this system, this paper is one more in the line of “Papers About Systems You Can’t Use or Buy“. Unfortunate.

Written by Rajarshi Guha

November 10th, 2011 at 11:00 pm

Posted in Literature,cheminformatics

Tagged with , , ,

## Caching SMARTS Queries

with 3 comments

Andrew Dalke recently published a detailed write up on his implementation of the Pubchem fingerprints and provided a pretty thorough comparison with the CDK implementation. He pointed out a number of bugs in the CDK version; but he also noted that performance could be improved by caching parsed SMARTS queries – which are used extensively in this fingerprinter. So I wanted to see whether caching really helps.

The CDK SMARTS parser is a JJTree based implementation and my anecdotal evidence suggested that SMARTS parsing was not a real bottleneck. But of course, nothing beats measurements. So I modified the SMARTSQueryTool class to cache parsed SMARTS queries using a simple LRU cache (patch):

 123456 final int MAX_ENTRIES = 20; Map cache = new LinkedHashMap(MAX_ENTRIES + 1, .75F, true) {     public boolean removeEldestEntry(Map.Entry eldest) {         return size() > MAX_ENTRIES;     } };

The map is keyed on the SMARTS string. Then, when we attempt to parse a new SMARTS query, we check if it’s in the cache and if not, parse it and place it in the cache.

Now, the thing about this caching mechanism is that after 20 SMARTS queries have been cached, the least recently used one is replaced with a new one. As a result, if we perform matching with 40 unique SMARTS (in sequence) only the last 20 get cached, for a given molecule. But when we go to do it on a new molecule, the first 20 are not in the cache and hence we shouldn’t really benefit from the caching scheme. In general, if the fingerprinter (or any caller of the parser) will perform N unique SMARTS queries for a single molecule, the cache size must be at least N, for the cache to be used for subsequent molecules.

I implemented a quick test harness, reading in 100 SMILES and then generating Pubchem fingerprints for each molecule. The fingerprint generation was repeated 5 times and the time reported for each round. The box plot shows the distribution of the timings. Now, the CDK implementation has 621 SMARTS patterns – as you can see, we only get a benefit from the caching when the cache size is 700. In fact, cache sizes below that lead to a performance hit  – I assume due to time required to query the map.

While the performance improvement is not dramatic it is close to 10% compared to no-caching at all. In actuality, the major bottleneck in the SMARTS parser is the actual graph isomorphism step (which we hope to drastically improve by using the SMSD code). Maybe then, SMARTS parsing will take a bigger fraction of the time. Also keep in mind that due to the heavyweight nature of CDK molecule objects, very large caches could be a strain on memory. But to check that out, I should use a profiler.

Written by Rajarshi Guha

January 23rd, 2011 at 12:56 am

## Visualizing PAINS SMARTS

without comments

A few days ago I had made available a SMARTS version of the PAINS substructural filters, that were converted using CACTVS from the original SLN patterns. I had mentioned that the SMARTSViewer application was a handy way to visualize the complex SMARTS patterns. Matthias Rarey let me know that his student had converted all the SMARTS to SMARTSViewer depictions and made them available as a PDF. Given the complexity of many of the PAINS patterns, these depictions are a very nice way to get a quick idea of what is supposed to match.

(FWIW, the SMARTS don’t reproduce the matches obtained using the original SLN’s – but hopefully the depictions will help anybody who’d like to try and fix the SMARTS).

Written by Rajarshi Guha

December 2nd, 2010 at 1:18 am

Posted in software,cheminformatics

Tagged with , ,

## Substructure Searching with Hadoop

with 7 comments

My last two posts have described recent attempts at working with Hadoop, a map/reduce framework. As I noted, Hadoop for cheminformatics is quite trivial when working with SMILES files, which is line oriented but requires a bit more work when dealing with multi-line records such as in SD files. But now that we have a SDFInputFormat class that handles SD files we’re ready to throw a variety of tasks at it. In this post I describe a class to perform SMARTS substructure searching using the CDK and Hadoop.

Doing substructure searching in a map/reduce framework is conceptually quite simple. In the map step, the mapper gets a record (in this case a single entry from the SD file) and performs the match using the supplied SMARTS pattern. It emits a key/value pair of the form “molid 1” if the molecule matched the pattern, otherwise “molid 0” if it did not. In either case, molid is some identifier for the given molecule.

The reduce step simply examines each key/value pair it receives, and discards those with values equal to 0. The resultant output will contain the ID’s (in this case molecule titles, since we’re reading from SD files) of the files that matched the supplied pattern.

The source for this class is given below

 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105 package net.rguha.dc; import net.rguha.dc.io.SDFInputFormat; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.openscience.cdk.CDKConstants; import org.openscience.cdk.ChemFile; import org.openscience.cdk.ChemObject; import org.openscience.cdk.exception.CDKException; import org.openscience.cdk.interfaces.IAtomContainer; import org.openscience.cdk.io.MDLV2000Reader; import org.openscience.cdk.smiles.smarts.SMARTSQueryTool; import org.openscience.cdk.tools.manipulator.ChemFileManipulator; import java.io.IOException; import java.io.StringReader; import java.util.List; public class SubSearch {     static SMARTSQueryTool sqt;static {         try {             sqt = new SMARTSQueryTool("C");         } catch (CDKException e) {                     }     }     private final static IntWritable one = new IntWritable(1);     private final static IntWritable zero = new IntWritable(0);     public static class MoleculeMapper extends Mapper {         private Text matches = new Text();         private String pattern;         public void setup(Context context) {             pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");         }         public void map(Object key, Text value, Context context) throws IOException, InterruptedException {             try {                 StringReader sreader = new StringReader(value.toString());                 MDLV2000Reader reader = new MDLV2000Reader(sreader);                 ChemFile chemFile = (ChemFile) reader.read((ChemObject) new ChemFile());                 List containersList = ChemFileManipulator.getAllAtomContainers(chemFile);                 IAtomContainer molecule = containersList.get(0);                 sqt.setSmarts(pattern);                 boolean matched = sqt.matches(molecule);                 matches.set((String) molecule.getProperty(CDKConstants.TITLE));                 if (matched) context.write(matches, one);                 else context.write(matches, zero);             } catch (CDKException e) {                 e.printStackTrace();             }         }     }     public static class SMARTSMatchReducer extends Reducer {         private IntWritable result = new IntWritable();         public void reduce(Text key, Iterable values,                            Context context) throws IOException, InterruptedException {             for (IntWritable val : values) {                 if (val.compareTo(one) == 0) {                     result.set(1);                     context.write(key, result);                 }             }         }     }     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();         if (otherArgs.length != 3) {             System.err.println("Usage: subsearch ");             System.exit(2);         }         // need to set it before we create the Job object         conf.set("net.rguha.dc.data.pattern", otherArgs[2]);                 Job job = new Job(conf, "id 1");         job.setJarByClass(SubSearch.class);         job.setMapperClass(MoleculeMapper.class);         job.setCombinerClass(SMARTSMatchReducer.class);         job.setReducerClass(SMARTSMatchReducer.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(IntWritable.class);         job.setInputFormatClass(SDFInputFormat.class);         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));         System.exit(job.waitForCompletion(true) ? 0 : 1);     } }

The map method in MoleculeMapper does the job of performing the SMARTS matching.  If the molecule matches, it writes out the molecule title and the value 1. The reduce method in SMARTSMatchReducer simple examines each key/value and writes out those keys whose value equals 1.

Another important thing to note is that when we pass in the SMARTS pattern as a command line parameter, it doesn’t automatically become available to the mappers since they will, in general, be run on different nodes that the one you started the program. So naively storing a command line argument in a variable in main will result in a NPE when you run the program. Instead, we read in the argument and set it as a value for a (arbitrary) key in the Configuration object (line 90). The object can then be accessed via the Context object in the mapper class (lines 43-45), wherever the mapper is being run.

We compile this to a jar file, and then run it on a 100-molecule SD file from Pub3D:

 12 $hadoop dfs -copyFromLocal ~/Downloads/small.sdf input.sdf$ hadoop jar rghadoop.jar input.sdf output.sdf "[R2]"

The output is of the form

 123456789101112 \$ hadoop dfs -cat output.sdf/part-r-00000 120059  1 20060138    1 20060139    1 20060140    1 20060141    1 20060146    1 3803680 1 3803685 1 3803687 1 3803694 1 ...

where each line lists the PubChem CID of the molecules that matched (27 in this case).

### Postscript

While I’ve been working on these examples with relatively small inputs on my own laptop, it’d be useful to test them out with larger datasets on a real multi-node Hadoop cluster. If anybody has such a setup (using 0.20.0 of Hadoop), I’d love to try these examples out. I’ll provide a single jar file and the large datasets.

Written by Rajarshi Guha

May 4th, 2009 at 9:24 pm

## Manipulating SMARTS Queries

with 3 comments

Yesterday, Andreas Maunz asked a question on the openbabel-discuss list:

… a possibility to combine two distinct smarts patterns. Of course, there is the comma OR operator (s1,s2), but I was thinking of a more sophisticated combination, where equal parts of s1 and s2 are “mapped onto each other” in a way such that merely the exclusive parts of s1 and s2 are connected via OR to the common core structure, where possible?

Fundamentally, a solution requires us to be able to manipulate and transform the SMARTS query in various ways. Operations of these types have been been discussed by Roger Sayle in a MUG presentation and these are available in the OpenEye tools. As far as I know, there are no open source implementations of such SMARTS optimizations (or manipulations). Later on in Andres’ thread, Andrew Dalke suggested that one could attack this by performing an MCSS calculation on two graphs representing the two queries, along with some appropriate normalizations.

It turns out that this type of manipulation isn’t too difficult in the CDK, thanks to Dazhi Jiao’s work on the JJTree based SMARTS parser. Since the parser generates an abstract syntax tree, the SmartQueryVisitor is able to walk the tree and generate an IQueryAtomContainer. Basically this is analogous to a molecule (i.e., IAtomContainer) and hence we can apply pre-exisiting code for substructure searches. Thus, to find the largest common query fragment between two SMARTS queries we can do:

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 import org.openscience.cdk.smiles.smarts.parser.SMARTSParser; import org.openscience.cdk.smiles.smarts.parser.ASTStart; import org.openscience.cdk.smiles.smarts.parser.ParseException; import org.openscience.cdk.smiles.smarts.parser.visitor.SmartsQueryVisitor; import org.openscience.cdk.isomorphism.matchers.IQueryAtomContainer; import org.openscience.cdk.isomorphism.UniversalIsomorphismTester; import org.openscience.cdk.interfaces.IAtom; import org.openscience.cdk.interfaces.IAtomContainer; import org.openscience.cdk.interfaces.IBond; import org.openscience.cdk.exception.CDKException; import java.util.List; public class SMARTSManipulator {     public SMARTSManipulator() {     }     private IQueryAtomContainer getQueryMolecule(String smarts) throws ParseException {         SMARTSParser parser = new SMARTSParser(new java.io.StringReader(smarts));         ASTStart ast = parser.Start();         SmartsQueryVisitor visitor = new SmartsQueryVisitor();         return (IQueryAtomContainer) visitor.visit(ast, null);     }     // helper to convert the IQueryAtomContainer to an IAtomContainer, to make the UIT     // happy. This is OK, since the atoms and bonds in the IAtomContainer are still the query     // atoms and query bonds     private IAtomContainer convertToAtomContainer(IQueryAtomContainer queryContainer) {         IAtomContainer x = queryContainer.getBuilder().newAtomContainer();         for (IAtom atom : queryContainer.atoms()) x.addAtom(atom);         for (IBond bond : queryContainer.bonds()) x.addBond(bond);         return x;     }     public void smartsMCSS() throws ParseException, CDKException {         IQueryAtomContainer queryTarget = getQueryMolecule("c1ccccc1C(=O)C*");         IQueryAtomContainer queryQuery = getQueryMolecule("*C([N,S])C=O");         boolean isSubgraph = UniversalIsomorphismTester.isSubgraph(convertToAtomContainer(queryTarget), queryQuery);         System.out.println("isSubgraph = " + isSubgraph);         System.out.println("");         List mcss = UniversalIsomorphismTester.getOverlaps(                 convertToAtomContainer(queryTarget),                 convertToAtomContainer(queryQuery));         System.out.println("mcss.size() = " + mcss.size() + "\n");         for (IAtomContainer aMcs : mcss) {             System.out.println("aMcs.getAtomCount() = " + aMcs.getAtomCount());             for (IAtom matchingAtom : aMcs.atoms())                 System.out.println("matchingAtom = " + matchingAtom);             System.out.println("");         }     }     public static void main(String[] args) throws ParseException, CDKException {         SMARTSManipulator m = new SMARTSManipulator();         m.smartsMCSS();     } }

Note that we convert the IQueryAtomContainer to IAtomContainer, since the UniversalIsomorphism tester expects that we are performing queries against an actual (i.e., real) molecule.

As expected the smaller query is not a subgraph of the larger query. But when we evaluate the MCSS, we do end up with a query fragment consisting of *CC=O.

Now this solution is just a quick hack to see if it could be done – it will fail on more general queries (such as C[#6]N and CCN) since we don’t perform any normalization. Furthermore, there are more sophisticated optimizations that could be done on the AST directly but that’s for another night

Written by Rajarshi Guha

March 7th, 2009 at 7:10 am