So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘lisp’ tag

Simple XML Parsing with Clojure

with 2 comments

A while back I had started playing with Clojure. It’s always been a spare-time hobby and not having had much spare time I haven’t really gotten as far ahead with it as I’d have liked. I’m still not sure why I like Clojure, but it is fun to code in. My interest was revitalized when I came across a Clojure group located in the D.C. area. So following on my previous post on geo-referencing PubMed articles, I decided to take a stab at doing the whole thing in Clojure.

One of the tasks in this project is to query PubMed using the EUtils CGIs and parse out the information from the XML document that is returned. It turns out that parsing XML documents or strings is very easy in Clojure.¬† The parse method in the clojure.xml namespace supports parsing of XML documents, returning a tree of tags. Using xml-zipper from the clojure.zip namespace creates a zipper data structure from the tree. Extracting specific elements is achieved by filtering the zipper by the path to the desired element. It’s a lot like the ElementTree module in Python (but doesn’t require that I insert namespaces before each and every element in the path!). We start of by working in our own namespace and then importing the relevant packages

1
2
3
4
(ns entrez
  (:require [clojure.xml :as xml])
  (:require [clojure.zip :as zip])
  (:require [clojure.contrib.zip-filter.xml :as zf]))

Next we define some helper methods

1
2
3
4
5
6
7
8
9
(defn get-ids [zipper]
  "Extract specific elements from an XML document"
  (zf/xml-> zipper :IdList :Id zf/text))

(defn get-affiliations [zipper]
  "Extract affiliations from PubMed abstracts"
  (map (fn [x y] (list x y))
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :PMID zf/text)
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :Article :Affiliation zf/text)))

Finally, we can get the ID’s from an esearch query by saving the results to a file and then running

1
2
3
(println (get-ids
      (zip/xml-zip
       (xml/parse "esearch.xml"))))

or extract affiliations from a set of PubMed abstracts obtained via an efetch query

1
2
3
(println (get-affiliations
      (zip/xml-zip
       (xml/parse "efetch.xml"))))

In the next post I’ll show some code to actually perform the queries via EUtils so that we don’t need to save results to files.

Written by Rajarshi Guha

February 17th, 2010 at 3:30 am

Posted in software,Uncategorized

Tagged with , , ,

Easy Parallelization With Clojure

with 6 comments

As I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides some fancy features that allow one to write complex parallel programs. I’m certainly no expert on that topic. However, one thing that I do everyday is perform operations on elements of a list. Traditionally, this is a serial operation. But what’d be nice is to have my compiler (or environment) perform this operation in parallel over the elements of the list. Clojure provides a very simple way to do this – pmap.

The map form simply applies a function to the elements of a list (or corresponding elements of multiple lists) in order, returning a list. By prepending the “p”, this is done in parallel making use of as many cores as are present on your system (but see below). Given the ease of this operation, lets see what we can do with pmap and the CDK.

Based on the previous post, lets calculate fingerprints in a serial fashion followed by the parallel version and see how long it takes. For completeness, I’ll repeat the code from the previous post. First import our packages and set up some basic objects and functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint MACCSFingerprinter))
(import '(org.openscience.cdk.fingerprint Fingerprinter))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))
;; so we can read lines from a file
(use 'clojure.contrib.duck-streams)

(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))
(def fprinter (new ExtendedFingerprinter))

(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

Next, we load the 4,688 molecules from the data file I described previously. In contrast to before, this code is slightly shorter (thanks to Nik) but also uses the doall form. This forces evaluation of the list, so that the molecules are all loaded into memory. In the timing code we again use it, since if we don’t, we just get the time for the list creation step (which is “instantaneous” due to lazy evaluation), rather than the actual list evaluation.

1
2
(def mols (doall (map #(getmol (. % trim))
              (read-lines "junk.smi"))))

Now, we can evaluate the fingerprints and time the operation. Initially I performed these calculations on my Macbook Pro with 2GB RAM and a dual core CPU.

1
(time (def fpserial (doall (map getfp mols))))

This run took 38.8 s (averaged over three runs). Next we consider the parallel version

1
(time (def fpparallel (doall (pmap getfp mols))))

This version has an average run time of 23.4 s – a 1.6x speedup. Now, it’s not exactly a two-fold speedup. Part of the reason is that there is some overhead for the threads. Also, even in the serial version, the garbage collector takes up some of the second core and in the parallel version, this will contend with the actual calculation.

Just to be sure that the calculation works OK, lets compare (via BitSet.equals) the fingerprints obtained using the two versions. We expect the result of the code below to be 0

1
2
3
4
5
(count (filter #(if (not %) %)
           (map
        (fn [x,y]
            (. x (equals y)))
        fpserial fpparallel)))

and that’s exactly what we get.

What about using more cores? I have access to some dual-CPU machines with 8GB of RAM, each CPU having four cores. Repeating the above calculations, the serial version takes 28.1 s and the parallel version takes 8.6 s, a 3.2x speedup. One thing I noted was that this really only uses the cores on one CPU, rather than all eight cores.

One thing that will require more investigation is to what extent we can make use of the CDK in parallel environments, since the library was not designed with thread-safety in mind. For example, parsing the SMILES strings using pmap (after reading in all the lines from the file) gives me an ExecutionException error.

In any case, it’s very cool that I can use multiple cores just converting map to pmap.

Written by Rajarshi Guha

March 6th, 2009 at 5:15 pm

Posted in cheminformatics,software

Tagged with , , ,

Testing CDK Fingerprints with Clojure

with 10 comments

Sometime back I was bored and thought that learning Lisp would be a good past time (maybe I should get a life?). I started of with SBCL and then discovered Clojure, a Lisp dialect that compiles to the JVM. The nice thing about this is that it allows one to write Lisp but also interact quite easily with Java libraries. As a result, I can get down to writing code without having to reimplement a bunch of stuff (such as cheminformatics methods for example).

There’s been quite a buzz about this implementation in the Lisp community. It so happens there have recently been a blog post and FF threads about the possible use of Clojure in bioinformatics. As Bosco pointed out, one reason for Clojures’ attractiveness is that¬† it has good support for concurrent programming – but I haven’t gotten that far yet.

So over the last few months I’ve been playing on and off with Clojure and the CDK – mostly small stuff to get the feel for Lisp. Today I came across a blog post indicating that the CDK fingerprints weren’t doing very well at identifying substructures.

C(C)(C)OCC(C)=C

C(C)(C)OCC(C)=C

As noted in the post, the EState and MACCS don’t do very well which is not too surprising. However the post seemed to imply that the hashed fingerprints weren’t working well.

This seemed like an opportunity to practice my Lisp skills. I downloaded the small molecule dataset from DrugBank and converted it to SMILES with OpenBabel. I did some quick filtering to remove a few molecules, resulting in 4,688 SMILES. For this test I just considered one of the query structures from the original post, shown alongside.

So the first step is to import some useful Java classes and Clojure packages

1
2
3
4
5
6
7
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))

;; so we can easily read lines from a file
(use 'clojure.contrib.duck-streams)

We then setup some of the objects that we’ll use such as the SMILES parser and fingerprinter

1
2
3
4
(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))

;; Use the default settings for the fingerprinter
(def fprinter (new ExtendedFingerprinter))

Finally, we create some simple helper methods

1
2
(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

OK, so now we’re ready to play. The first step is to create a fingerprint for the query molecule

1
2
(def querySmiles "C(C)(C)OCC(C)=C")
(def queryfp (getfp (getmol querySmiles)))

Next, we load the molecules by reading each line from the SMILES file and parsing the string into a molecule.

1
2
3
4
(def mols
     (map (fn [x]
        (getmol (. x trim)))
      (read-lines "junk.smi")))

The nice thing about Clojure is that lists are lazy – so if you execute this statement it returns immediately. The molecules will only be parsed when you start accessing the list and even then, it only parses those molecules that are actually accessed. So in other words the following code will give you the first molecule in the list, but all the rest will still remain “unparsed”

1
(first mols)

So given our molecule list, we first want to check whether the query molecule is a substructure using subgraph isomorphism. We can do this using the SMARTSQueryTool as follows

1
2
3
4
5
(def sqt (new SMARTSQueryTool querySmiles))
(def matches
     (map (fn [x]
        (. sqt (matches x)))
      mols))

The result is that matches is a (lazy) list of boolean values. We’d like to see how many of these values are true

1
(count (filter (fn [x] (if x x)) matches))

This gives us 7 (which also matches the result of obgrep). So clearly our query exists as a substructure in this target collection. How do the fingerprints perform?

First we make a helper function to check whether the query fingerprint is contained within the target fingerprint. This is performed by and’ing the two fingerprints and checking whether the result equals the query fingerprint.

1
2
3
(defn issubstruct? [query,target]
  (let [x (. target (and query))]
    (. target (equals query))))

The let form in this function is key – the and method of Java’s BitSet class modifies the object rather than returning the resultant BitSet. This is contrary to the functional paradigm (since a truly functional language does not modify objects). As a result, the return value of (. target (and query)) is nil, and if you then use this in equals, you get a NullPointerException. The let form simply assigns this nil value to x, which we then forget about. But we’re interested in the side-effect – the variable target is now the result of target AND query.

Using this function we can then easily check whether the query fingerprint is contained within the target fingerprints by generating the target fingerprints on the fly and calling issubstruct?

1
2
3
4
5
(def fpmatches
     (map (fn [x]
        (issubstruct?
         queryfp (getfp x)))
      mols))

As before, we want to identify how many values are true

1
(count (filter (fn [x] (if x x)) fpmatches))

which gives us 83.

The fact that we get more than 7 hits using this method is OK since we’re using hashed fingerprints. So it seems that the CDK fingerprints (at least the extended fingerprinter) are working OK. On looking at the posted results however, some queries are not returning enough fingerprint matches. To properly answer this issue we’ll need to look at the original data.

But as a quick example of using Clojure for some reasonable task, this turned out pretty nicely. While I probably won’t be switching to Clojure for full time work, it is a fun platform to play with. (Also, there may be a possibility that R might be going to a more Lisp like form in the future, so this is good practice anyway!)

Written by Rajarshi Guha

March 6th, 2009 at 1:32 am