Cheminformatics with Hadoop and EC2

In the last few posts I’ve described how I’ve gotten up to speed on developing Map/Reduce applications using the Hadoop framework. The nice thing is that I can set it all up and test it out on my laptop and then easily migrate the application to a large production cluster. Over the past few days I converted my pharmacophore searching code into a Hadoop application. After doing some searches on small collections, it was time to test it out on a real cluster.

Cluster Setup

The cluster I used was Amazon EC2. Before running Hadoop applications, you need to set up your local environment for EC2, following excellent instructions. With this in place, you’ll need to setup a Hadoop cluster. While one can do it using tools from the Hadoop sources, Cloudera provides a very easy set of setup scripts. Instructions are given here. With the Cloudera scripts installed, you can set up a 10 node cluster by doing (from your local machine)

1	hadoop-ec2 launch-cluster rgc 10

where rgc is the name of your cluster. There’ll be some output and it will provide you with the hostname of the master node of your cluster (to which you can ssh and start jobs). You can also visit http://hostname:50030 to track the progress of Hadoop jobs. By default this process will use c1.medium EC2 instances, though you can change this in the set up scripts. Also note that each node will run 2 map tasks – this will be useful later on.

Finally, when you’re done with the cluster remember to terminate it! Otherwise you’re going to rack up bills.

Data Setup

So the cluster is ready to run jobs. But we need data for it. A simple approach is to use scp to copy data files onto the master node and then copy the data files to the HDFS on your EC2 cluster. This not a good idea for any real sized dataset, as you will loose all the data once the cluster is terminated. A better idea is to load the input data in S3. I use S3Fox, a Firefox extension, to load data from my laptop into S3. Once you have the data file in a S3 bucket, you can access it on an EC2 node using the following notation

1	s3n://AWS-ACCESS-ID:AWS-SECRET-ID@bucket/filename

For my particular set up, I obtained 136K structures from Pub3D as a single SD file and uploaded it into an S3 bucket. However, I used scp to copy my Hadoop program jar file and the pharmacophore definition file directly onto the master node, as they were relatively small. I should note that for this run, the 136K structures were only about 560MB – tiny compared to what one would usually use Hadoop for.

Program Setup

While developing the Hadoop program I had started using Hadoop 0.20.0. But Amazon only supports version 0.18.3. So some refactoring was required. The only other thing that I had to modify in my program was to add the statement

1	conf.set("mapred.map.tasks", "20");

to indicate that the application should try and use up to 20 map tasks. While this is usually taken as a suggestion by Hadoop, my experiment indicated that without this it would only run two map tasks (and hence 1 node) rather than say 20 map tasks for 10 nodes. This is due to the way the input file is split – the default is to create 128MB splits, thus requiring about 4 map tasks (since each split goes to a single mapper). By specifying we want 20 map tasks, we can ‘force’ the use of multiple nodes. At this point, I’m not entire clear as to why I need to force it this way. My understanding is that this is not required when dealing with multi-gigabyte input files.

In preparation for the run, I compiled all the classes and created a single jar file containing the CDK as well my own application classes. This avoids having to fiddle with classpaths on the Hadoop cluster. You can get the sources from my GitHib repository (the v18 branch is the one for running on Amazon).

Runs

With our cluster, data and program all set up we can set of a run. With my input data on S3, I logged into my master node on EC2 and the run was started with

1	hadoop jar rghadoop.jar s3n://AWS-ACCESS-ID:AWS-SECRET-ID@pcore/input.sdf output cns2.xml

While this runs you can view the job progress via http://hostname:50030 (hostname being whatever the cluster setup process provided). My initial run used a 4 node cluster and took 6 min 35 sec. However it was simple to terminate this cluster and restart one with 10 nodes. On the new cluster the run time dropped to 3 min 33 sec to process 136K structures. For comparison, running the same command, using 2 map tasks, took about 20 minutes on my MacBook Pro (2.4 GHz, 4GB RAM).

Cost Issues

So how much did this experiment cost? While I don’t have the exact numbers, the actual processing on the 4-node cluster cost $0.80 – four c2.medium instances at $0.20 / hour (since anytime less than an hour is still billed as an hour). Clearly, the 10-node cluster cost $2.00 – but while the result was obtained faster, we could have simply stayed with the 4-node cluster and saved half the price. Of course, the actual price will be a little higher since it took some time to upload the application and start the jobs. Another cost was S3 storage. Currently I’m using less than 1GB and when band width costs are taken into account this is about $0.25. But less than $5 is not too bad. There’s also a handy application to estimate costs associated with various Amazon services.

Conclusions

While this experiment didn’t actually highlight new algorithms or methods, it does highlight the ease with which data intensive computation can be handled. What was really cool for me was that I have access to massive compute resources, accessible with a few simple command line invocations. Another nice thing is that the Hadoop framework, makes handling large data problems pretty much trivial – as opposed to chunking my data by hand, making sure each chunk is processed by a different node and all the scheduling issues associated with this.

The next thing to look at is how one can access the Amazon public datasets stored on EBS from a Hadoop cluster. This will allow pharmacophore searching for the entire PubChem collection – either via the Pub3D dataset (single conformer) or else via the PubChem dataset (multiple conformers). While I’ve focused on pharmacophore searching, one can consider arbitrary cheminformatics tasks.

Going further, one could consider the use of HBase, a column store based on Hadoop, as a storage system for large chemical collections and couple it to Hadoop applications. This will be useful, if the use case does not involve complex relational queries. Going back to pharmacophore searches, one could imagine running searches against large collections stored in HBase, and updating the database with the results – in this case, database usage is essentially simple lookups based on compound ID, as opposed to relational queries.

Finally, it’d also be useful to try and consider cheminformatics applications that could make use of the Map/Reduce framework at an algorithmic level, as opposed to Map/Reduce to simply processe data in chunks. Some immediate applications that come to mind include pharmacophore discovery and diversity analysis.

7 thoughts on “Cheminformatics with Hadoop and EC2”

Tom White says:

May 12, 2009 at 7:47 pm

To access the Amazon public datasets you can simply mount a volume then copy it into HDFS. There are details here: http://www.cloudera.com/blog/2009/05/11/using-clouderas-hadoop-amis-to-process-ebs-datasets-on-ec2/

Cheers,
Tom

Rajarshi Guha says:

May 13, 2009 at 12:59 am

Tom, thanks for the pointer. This will be very useful

When software embraces failure : business|bytes|genes|molecules says:

May 23, 2009 at 6:27 pm

[…] the meantime follow the work Rajarshi is doing with Hadoop and of course work coming from people like Mike […]

Coast to Coast Bio Podcast » Blog Archive » Episode 20: Man, it’s been forever says:

June 15, 2009 at 12:32 am

[…] Cheminformatics with Hadoop and EC2 […]

mat says:

September 9, 2009 at 11:18 am

i can relate to the refactor from 0.20 to 0.19, i had to go through the same thing!

Harshit Kumar says:

October 20, 2009 at 3:17 am

Hi

I cannot access the hostname:50030 on EC2.

Do I need to start Apache server instance also.

Rajarshi Guha says:

October 25, 2009 at 7:07 pm

Harshit, no, you shouldn’t need to start apache or anything like that

So much to do, so little time

Trying to squeeze sense out of chemical data