Pig and Cheminformatics

Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop).

Previously I had investigated Hadoop for running cheminformatics tasks such as SMARTS matching and pharmacophore searching. While the implementation of such code is pretty straightforward, it’s still pretty heavyweight compared to say, performing SMARTS matching in a database via SQL. On the other hand, being able to perform these tasks in Pig Latin, lets us write much simpler code that can be integrated with other non-cheminformatics code in a flexible manner. An example of Pig Latin script that we might want to execute is:

1
2
3
A = load 'medium.smi' as (smiles:chararray);
B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
store B into 'output.txt';

The script loads a file containing SMILES strings and then filters entries that match the specified SMARTS pattern and writes out the matching SMILES to an output file. Clearly, very similar to SQL. However, the above won’t work on a default Pig installation since SMATCH is not a builtin function. Instead we need to look at  a user defined function (UDF).

UDF’s are implemented in Java and can be classified into one of three types: eval, aggregate or filter functions. For this example I’ll consider a filter function that takes two strings representing a SMILES string and a SMARTS string and returns true if the SMILES contains the specified pattern.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class SMATCH extends FilterFunc {
    static SMARTSQueryTool sqt;static {
        try {
            sqt = new SMARTSQueryTool("C");
        } catch (CDKException e) {
            System.out.println(e);
        }
    }
    static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());

    public Boolean exec(Tuple tuple) throws IOException {
        if (tuple == null || tuple.size() < 2) return false;
        String target = (String) tuple.get(0);
        String query = (String) tuple.get(1);
        try {
            sqt.setSmarts(query);
            IAtomContainer mol = sp.parseSmiles(target);
            return sqt.matches(mol);
        } catch (CDKException e) {
            throw WrappedIOException.wrap("Error in SMARTS pattern or SMILES string "+query, e);
        }
    }
}

A UDF for filtering must implement the FilterFunc interface which specifies a single method, exec. Within this method, we check whether we have the requisite number of input arguments and if so, simply return the value of the SMARTS match. For more details on filter functions see the UDF manual.

One of the key features of the code is the static initialization of the SMILES parser and SMARTS matcher. I’m not entirely sure how many times the UDF is instantiated during a query (once for each “row”? Once for the entire query?) – but if it’s more than once, we don’t want to instantiate the parser and matcher in the exec function. Note that since Hadoop is not a multithreaded model, we don’t need to worry about the lack of thread safety in the CDK.

Compiling the above class and packaging it into a jar file, allows us to run the above Pig Latin script (you’ll have to register the jar file at the beginning by writing register /path/to/myudf.jar) from the command line:

1
2
3
4
5
6
7
$ pig-0.4.0/bin/pig -x local match.pig # runs in local mode
$ 2010-09-09 20:37:00,107 [main] INFO  org.apache.pig.Main - Logging error messages to: /Users/rguha/src/java/hadoop-0.18.3/pig_1284079020107.log
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/Users/rguha/src/java/hadoop-0.18.3/output.txt"
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 9
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!

As with my previous Hadoop code, the above UDF can be deployed anywhere that Hadoop and HDFS is installed – such as Amazon. The code in this post (and for other Pig UDFs) is available from my repository (in the v18 branch) and is based on Pig 0.4.0 – which is pretty old, but is required to work with Hadoop 0.18.3.

Leave a Reply

Your email address will not be published. Required fields are marked *