Pig and Cheminformatics

Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop). Previously I had […]

Cheminformatics with Hadoop and EC2

In the last few posts I’ve described how I’ve gotten up to speed on developing Map/Reduce applications using the Hadoop framework. The nice thing is that I can set it all up and test it out on my laptop and then easily migrate the application to a large production cluster. Over the past few days […]

Hadoop, Chunks and Multi-line Records

In a previous post I described how one requires a custom RecordReader class to deal with multi-line records (such as SD files) in a Hadoop program. While it worked fine on a small input file (less than 5MB) I had not addressed the issue of “chunking” and that caused it to fail when dealing with […]

Substructure Searching with Hadoop

My last two posts have described recent attempts at working with Hadoop, a map/reduce framework. As I noted, Hadoop for cheminformatics is quite trivial when working with SMILES files, which is line oriented but requires a bit more work when dealing with multi-line records such as in SD files. But now that we have a […]

Hadoop and SD Files

In my previous post I had described my initial attempts at working with Hadoop, an implementation of the map/reduce framework. Most Hadoop examples are based on line oriented input files. In the cheminformatics domain, SMILES files are line oriented and so applying Hadoop to a variety of tasks that work with SMILES input is easy. […]