My last two posts have described recent attempts at working with Hadoop, a map/reduce framework. As I noted, Hadoop for cheminformatics is quite trivial when working with SMILES files, which is line oriented but requires a bit more work when dealing with multi-line records such as in SD files. But now that we have a SDFInputFormat class that handles SD files we’re ready to throw a variety of tasks at it. In this post I describe a class to perform SMARTS substructure searching using the CDK and Hadoop.

Doing substructure searching in a map/reduce framework is conceptually quite simple. In the map step, the mapper gets a record (in this case a single entry from the SD file) and performs the match using the supplied SMARTS pattern. It emits a key/value pair of the form “molid 1” if the molecule matched the pattern, otherwise “molid 0” if it did not. In either case, molid is some identifier for the given molecule.

The reduce step simply examines each key/value pair it receives, and discards those with values equal to 0. The resultant output will contain the ID’s (in this case molecule titles, since we’re reading from SD files) of the files that matched the supplied pattern.

The source for this class is given below

The map method in MoleculeMapper does the job of performing the SMARTS matching.  If the molecule matches, it writes out the molecule title and the value 1. The reduce method in SMARTSMatchReducer simple examines each key/value and writes out those keys whose value equals 1.

Another important thing to note is that when we pass in the SMARTS pattern as a command line parameter, it doesn’t automatically become available to the mappers since they will, in general, be run on different nodes that the one you started the program. So naively storing a command line argument in a variable in main will result in a NPE when you run the program. Instead, we read in the argument and set it as a value for a (arbitrary) key in the Configuration object (line 90). The object can then be accessed via the Context object in the mapper class (lines 43-45), wherever the mapper is being run.

We compile this to a jar file, and then run it on a 100-molecule SD file from Pub3D:

 12 $hadoop dfs -copyFromLocal ~/Downloads/small.sdf input.sdf$ hadoop jar rghadoop.jar input.sdf output.sdf "[R2]"

The output is of the form

 123456789101112 \$ hadoop dfs -cat output.sdf/part-r-00000 120059  1 20060138    1 20060139    1 20060140    1 20060141    1 20060146    1 3803680 1 3803685 1 3803687 1 3803694 1 ...

where each line lists the PubChem CID of the molecules that matched (27 in this case).

### Postscript

While I’ve been working on these examples with relatively small inputs on my own laptop, it’d be useful to test them out with larger datasets on a real multi-node Hadoop cluster. If anybody has such a setup (using 0.20.0 of Hadoop), I’d love to try these examples out. I’ll provide a single jar file and the large datasets.

## 7 thoughts on “Substructure Searching with Hadoop”

1. […] Substructure Searching with Hadoop at So much to do, so little time […]

2. […] Substructure Searching with Hadoop at So much to do, so little time […]

3. Rajarshi,

If you want to test this on a larger cluster, I think Pub3d is available as a public dataset on Amazon:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2284&categoryID=247

You can launch a larger 0.20 Hadoop cluster on EC2 for a couple of dollars using the bash scripts bundled with Hadoop, then mount the public dataset and load it into HDFS.

-Pete

4. Pete, thanks for the pointer. Yes, I was planning to use Pub3D, though I wasn’t sure whether EC2 had supprot for 0.20.0 (the Cloudera AMI seems to use 0.18.0)

5. […] the last few posts I’ve described how I’ve gotten up to speed on developing Map/Reduce applications using […]

6. […] the meantime follow the work Rajarshi is doing with Hadoop and of course work coming from people like Mike […]

7. […] Previously I had investigated Hadoop for running cheminformatics tasks such as SMARTS matching and pharmacophore searching. While the implementation of such code is pretty straightforward, it’s still pretty heavyweight compared to say, performing SMARTS matching in a database via SQL. On the other hand, being able to perform these tasks in Pig Latin, lets us write much simpler code that can be integrated with other non-cheminformatics code in a flexible manner. An example of Pig Latin script that we might want to execute is: 123A = load 'medium.smi' as (smiles:chararray); B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N'); store B into 'output.txt'; […]