Substructure Searching with Hadoop

My last two posts have described recent attempts at working with Hadoop, a map/reduce framework. As I noted, Hadoop for cheminformatics is quite trivial when working with SMILES files, which is line oriented but requires a bit more work when dealing with multi-line records such as in SD files. But now that we have a SDFInputFormat class that handles SD files we’re ready to throw a variety of tasks at it. In this post I describe a class to perform SMARTS substructure searching using the CDK and Hadoop.

Doing substructure searching in a map/reduce framework is conceptually quite simple. In the map step, the mapper gets a record (in this case a single entry from the SD file) and performs the match using the supplied SMARTS pattern. It emits a key/value pair of the form “molid 1” if the molecule matched the pattern, otherwise “molid 0” if it did not. In either case, molid is some identifier for the given molecule.

The reduce step simply examines each key/value pair it receives, and discards those with values equal to 0. The resultant output will contain the ID’s (in this case molecule titles, since we’re reading from SD files) of the files that matched the supplied pattern.

The source for this class is given below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

package net.rguha.dc;

import net.rguha.dc.io.SDFInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.openscience.cdk.CDKConstants;
import org.openscience.cdk.ChemFile;
import org.openscience.cdk.ChemObject;
import org.openscience.cdk.exception.CDKException;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.io.MDLV2000Reader;
import org.openscience.cdk.smiles.smarts.SMARTSQueryTool;
import org.openscience.cdk.tools.manipulator.ChemFileManipulator;

import java.io.IOException;
import java.io.StringReader;
import java.util.List;

public class SubSearch {
static SMARTSQueryTool sqt;static {
try {
sqt = new SMARTSQueryTool("C");
} catch (CDKException e) {
}
}

private final static IntWritable one = new IntWritable(1);
private final static IntWritable zero = new IntWritable(0);

public static class MoleculeMapper extends Mapper<Object, Text, Text, IntWritable> {

private Text matches = new Text();
private String pattern;

public void setup(Context context) {
pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");
}

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
try {
StringReader sreader = new StringReader(value.toString());
MDLV2000Reader reader = new MDLV2000Reader(sreader);
ChemFile chemFile = (ChemFile) reader.read((ChemObject) new ChemFile());
List<IAtomContainer> containersList = ChemFileManipulator.getAllAtomContainers(chemFile);
IAtomContainer molecule = containersList.get(0);

sqt.setSmarts(pattern);
boolean matched = sqt.matches(molecule);
matches.set((String) molecule.getProperty(CDKConstants.TITLE));
if (matched) context.write(matches, one);
else context.write(matches, zero);
} catch (CDKException e) {
e.printStackTrace();
}
}
}

public static class SMARTSMatchReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for (IntWritable val : values) {
if (val.compareTo(one) == 0) {
result.set(1);
context.write(key, result);
}
}
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("Usage: subsearch <in> <out> <pattern>");
System.exit(2);
}

// need to set it before we create the Job object
conf.set("net.rguha.dc.data.pattern", otherArgs[2]);

Job job = new Job(conf, "id 1");
job.setJarByClass(SubSearch.class);
job.setMapperClass(MoleculeMapper.class);
job.setCombinerClass(SMARTSMatchReducer.class);
job.setReducerClass(SMARTSMatchReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(SDFInputFormat.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

The map method in MoleculeMapper does the job of performing the SMARTS matching. If the molecule matches, it writes out the molecule title and the value 1. The reduce method in SMARTSMatchReducer simple examines each key/value and writes out those keys whose value equals 1.

Another important thing to note is that when we pass in the SMARTS pattern as a command line parameter, it doesn’t automatically become available to the mappers since they will, in general, be run on different nodes that the one you started the program. So naively storing a command line argument in a variable in main will result in a NPE when you run the program. Instead, we read in the argument and set it as a value for a (arbitrary) key in the Configuration object (line 90). The object can then be accessed via the Context object in the mapper class (lines 43-45), wherever the mapper is being run.

We compile this to a jar file, and then run it on a 100-molecule SD file from Pub3D:

1 2	$ hadoop dfs -copyFromLocal ~/Downloads/small.sdf input.sdf $ hadoop jar rghadoop.jar input.sdf output.sdf "[R2]"

The output is of the form

1
2
3
4
5
6
7
8
9
10
11
12

$ hadoop dfs -cat output.sdf/part-r-00000
120059 1
20060138 1
20060139 1
20060140 1
20060141 1
20060146 1
3803680 1
3803685 1
3803687 1
3803694 1
...

where each line lists the PubChem CID of the molecules that matched (27 in this case).

Postscript

While I’ve been working on these examples with relatively small inputs on my own laptop, it’d be useful to test them out with larger datasets on a real multi-node Hadoop cluster. If anybody has such a setup (using 0.20.0 of Hadoop), I’d love to try these examples out. I’ll provide a single jar file and the large datasets.

7 thoughts on “Substructure Searching with Hadoop”

» Kirill Osenkov : A simple sample for C# 4.0 ‘dynamic' feature c# Blog says:

May 5, 2009 at 1:37 pm

[…] Substructure Searching with Hadoop at So much to do, so little time […]

» Method return type problem - Java Forums Another java Blog says:

May 5, 2009 at 2:24 pm

[…] Substructure Searching with Hadoop at So much to do, so little time […]

Pete Skomoroch says:

May 5, 2009 at 2:55 pm

Rajarshi,

If you want to test this on a larger cluster, I think Pub3d is available as a public dataset on Amazon:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2284&categoryID=247

You can launch a larger 0.20 Hadoop cluster on EC2 for a couple of dollars using the bash scripts bundled with Hadoop, then mount the public dataset and load it into HDFS.

-Pete

Rajarshi Guha says:

May 5, 2009 at 9:44 pm

Pete, thanks for the pointer. Yes, I was planning to use Pub3D, though I wasn’t sure whether EC2 had supprot for 0.20.0 (the Cloudera AMI seems to use 0.18.0)

Cheminformatics with Hadoop and EC2 at So much to do, so little time says:

May 12, 2009 at 2:24 am

[…] the last few posts I’ve described how I’ve gotten up to speed on developing Map/Reduce applications using […]

When software embraces failure : business|bytes|genes|molecules says:

May 23, 2009 at 6:28 pm

[…] the meantime follow the work Rajarshi is doing with Hadoop and of course work coming from people like Mike […]

Pig and Cheminformatics at So much to do, so little time says:

September 10, 2010 at 2:21 am

[…] Previously I had investigated Hadoop for running cheminformatics tasks such as SMARTS matching and pharmacophore searching. While the implementation of such code is pretty straightforward, it’s still pretty heavyweight compared to say, performing SMARTS matching in a database via SQL. On the other hand, being able to perform these tasks in Pig Latin, lets us write much simpler code that can be integrated with other non-cheminformatics code in a flexible manner. An example of Pig Latin script that we might want to execute is: 123A = load 'medium.smi' as (smiles:chararray); B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N'); store B into 'output.txt'; […]

So much to do, so little time

Trying to squeeze sense out of chemical data

Postscript

7 thoughts on “Substructure Searching with Hadoop”

Leave a Reply Cancel reply