Easy Parallelization With Clojure

As I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides some fancy features that allow one to write complex parallel programs. I’m certainly no expert on that topic. However, one thing that I do everyday is perform operations on elements of a list. Traditionally, this is a serial operation. But what’d be nice is to have my compiler (or environment) perform this operation in parallel over the elements of the list. Clojure provides a very simple way to do this – pmap.

The map form simply applies a function to the elements of a list (or corresponding elements of multiple lists) in order, returning a list. By prepending the “p”, this is done in parallel making use of as many cores as are present on your system (but see below). Given the ease of this operation, lets see what we can do with pmap and the CDK.

Based on the previous post, lets calculate fingerprints in a serial fashion followed by the parallel version and see how long it takes. For completeness, I’ll repeat the code from the previous post. First import our packages and set up some basic objects and functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint MACCSFingerprinter))
(import '(org.openscience.cdk.fingerprint Fingerprinter))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))
;; so we can read lines from a file
(use 'clojure.contrib.duck-streams)

(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))
(def fprinter (new ExtendedFingerprinter))

(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

Next, we load the 4,688 molecules from the data file I described previously. In contrast to before, this code is slightly shorter (thanks to Nik) but also uses the doall form. This forces evaluation of the list, so that the molecules are all loaded into memory. In the timing code we again use it, since if we don’t, we just get the time for the list creation step (which is “instantaneous” due to lazy evaluation), rather than the actual list evaluation.

1
2
(def mols (doall (map #(getmol (. % trim))
              (read-lines "junk.smi"))))

Now, we can evaluate the fingerprints and time the operation. Initially I performed these calculations on my Macbook Pro with 2GB RAM and a dual core CPU.

1
(time (def fpserial (doall (map getfp mols))))

This run took 38.8 s (averaged over three runs). Next we consider the parallel version

1
(time (def fpparallel (doall (pmap getfp mols))))

This version has an average run time of 23.4 s – a 1.6x speedup. Now, it’s not exactly a two-fold speedup. Part of the reason is that there is some overhead for the threads. Also, even in the serial version, the garbage collector takes up some of the second core and in the parallel version, this will contend with the actual calculation.

Just to be sure that the calculation works OK, lets compare (via BitSet.equals) the fingerprints obtained using the two versions. We expect the result of the code below to be 0

1
2
3
4
5
(count (filter #(if (not %) %)
           (map
        (fn [x,y]
            (. x (equals y)))
        fpserial fpparallel)))

and that’s exactly what we get.

What about using more cores? I have access to some dual-CPU machines with 8GB of RAM, each CPU having four cores. Repeating the above calculations, the serial version takes 28.1 s and the parallel version takes 8.6 s, a 3.2x speedup. One thing I noted was that this really only uses the cores on one CPU, rather than all eight cores.

One thing that will require more investigation is to what extent we can make use of the CDK in parallel environments, since the library was not designed with thread-safety in mind. For example, parsing the SMILES strings using pmap (after reading in all the lines from the file) gives me an ExecutionException error.

In any case, it’s very cool that I can use multiple cores just converting map to pmap.

Testing CDK Fingerprints with Clojure

Sometime back I was bored and thought that learning Lisp would be a good past time (maybe I should get a life?). I started of with SBCL and then discovered Clojure, a Lisp dialect that compiles to the JVM. The nice thing about this is that it allows one to write Lisp but also interact quite easily with Java libraries. As a result, I can get down to writing code without having to reimplement a bunch of stuff (such as cheminformatics methods for example).

There’s been quite a buzz about this implementation in the Lisp community. It so happens there have recently been a blog post and FF threads about the possible use of Clojure in bioinformatics. As Bosco pointed out, one reason for Clojures’ attractiveness is that  it has good support for concurrent programming – but I haven’t gotten that far yet.

So over the last few months I’ve been playing on and off with Clojure and the CDK – mostly small stuff to get the feel for Lisp. Today I came across a blog post indicating that the CDK fingerprints weren’t doing very well at identifying substructures.

C(C)(C)OCC(C)=C

C(C)(C)OCC(C)=C

As noted in the post, the EState and MACCS don’t do very well which is not too surprising. However the post seemed to imply that the hashed fingerprints weren’t working well.

This seemed like an opportunity to practice my Lisp skills. I downloaded the small molecule dataset from DrugBank and converted it to SMILES with OpenBabel. I did some quick filtering to remove a few molecules, resulting in 4,688 SMILES. For this test I just considered one of the query structures from the original post, shown alongside.

So the first step is to import some useful Java classes and Clojure packages

1
2
3
4
5
6
7
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))

;; so we can easily read lines from a file
(use 'clojure.contrib.duck-streams)

We then setup some of the objects that we’ll use such as the SMILES parser and fingerprinter

1
2
3
4
(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))

;; Use the default settings for the fingerprinter
(def fprinter (new ExtendedFingerprinter))

Finally, we create some simple helper methods

1
2
(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

OK, so now we’re ready to play. The first step is to create a fingerprint for the query molecule

1
2
(def querySmiles "C(C)(C)OCC(C)=C")
(def queryfp (getfp (getmol querySmiles)))

Next, we load the molecules by reading each line from the SMILES file and parsing the string into a molecule.

1
2
3
4
(def mols
     (map (fn [x]
        (getmol (. x trim)))
      (read-lines "junk.smi")))

The nice thing about Clojure is that lists are lazy – so if you execute this statement it returns immediately. The molecules will only be parsed when you start accessing the list and even then, it only parses those molecules that are actually accessed. So in other words the following code will give you the first molecule in the list, but all the rest will still remain “unparsed”

1
(first mols)

So given our molecule list, we first want to check whether the query molecule is a substructure using subgraph isomorphism. We can do this using the SMARTSQueryTool as follows

1
2
3
4
5
(def sqt (new SMARTSQueryTool querySmiles))
(def matches
     (map (fn [x]
        (. sqt (matches x)))
      mols))

The result is that matches is a (lazy) list of boolean values. We’d like to see how many of these values are true

1
(count (filter (fn [x] (if x x)) matches))

This gives us 7 (which also matches the result of obgrep). So clearly our query exists as a substructure in this target collection. How do the fingerprints perform?

First we make a helper function to check whether the query fingerprint is contained within the target fingerprint. This is performed by and’ing the two fingerprints and checking whether the result equals the query fingerprint.

1
2
3
(defn issubstruct? [query,target]
  (let [x (. target (and query))]
    (. target (equals query))))

The let form in this function is key – the and method of Java’s BitSet class modifies the object rather than returning the resultant BitSet. This is contrary to the functional paradigm (since a truly functional language does not modify objects). As a result, the return value of (. target (and query)) is nil, and if you then use this in equals, you get a NullPointerException. The let form simply assigns this nil value to x, which we then forget about. But we’re interested in the side-effect – the variable target is now the result of target AND query.

Using this function we can then easily check whether the query fingerprint is contained within the target fingerprints by generating the target fingerprints on the fly and calling issubstruct?

1
2
3
4
5
(def fpmatches
     (map (fn [x]
        (issubstruct?
         queryfp (getfp x)))
      mols))

As before, we want to identify how many values are true

1
(count (filter (fn [x] (if x x)) fpmatches))

which gives us 83.

The fact that we get more than 7 hits using this method is OK since we’re using hashed fingerprints. So it seems that the CDK fingerprints (at least the extended fingerprinter) are working OK. On looking at the posted results however, some queries are not returning enough fingerprint matches. To properly answer this issue we’ll need to look at the original data.

But as a quick example of using Clojure for some reasonable task, this turned out pretty nicely. While I probably won’t be switching to Clojure for full time work, it is a fun platform to play with. (Also, there may be a possibility that R might be going to a more Lisp like form in the future, so this is good practice anyway!)

Cheminformatics in R – rcdk

Being an R aficionado, I do the bulk of my work in R and having grown up with Emacs I tend to dislike having to exit my environment to do “other” stuff. This was the motivation for integrating R and the CDK, so that I could access and manipulate chemical information from within my R session. This resulted in the rcdk package.

Since then there have been a lot of improvements in the CDK and so the latest version (2.9.2) of rcdk includes them and also provides access to much more of the CDK via R idioms. As the original J. Stat. Soft. paper is now pretty much deprecated, we have included a tutorial in the form of a vignette. The latest version of rcdk is now much smaller, since we have split out the actual CDK libraries into a separate package called rcdklibs. This allows us to release new versions of rcdk, without requiring a bulky download each time, since rcdklibs should change at a slower pace. I’d also like to thank Miguel Rojas Cherto for his contributions to this version of rcdk (as well as to rpubchem).

So what can you do with rcdk? Installation is pretty simple – just point your favorite interface to CRAN  (or  a mirror) and it should get it along with all the dependencies. After loading the library, you can read in any file format that the CDK supports or directly parse a SMILES

1
2
mols <- load.molecules("mymols.sdf")
mol.smiles <- parse.smiles("CC(=O)Cc1cc(Cl)ccc1")

which gives you a list of molecule objects. Note that these objects are actually pointers to Java objects and so you can’t serialize these via R’s save command. This is a pain and so I’m planning to implement some code generators that will create S4 classes directly from the Java class definitions.

Once you have a molecule object you can do a variety of things:

1
2
3
4
5
6
7
8
9
## view molecule depictions
view.molecule.2d(mols)

## evaluate fingerprints
fps <- get.fingerprints(mols, type="maccs")

## generate descriptors
dnames <- get.desc.names("topological")
descs <- eval.desc(mols, dnames)

One problem with the depiction code is that it does not work well on OS X. This is due to interactions between rJava and the R event handling loop. As a result, depictions show up, but then you can’t interact with the window. It does work fine on Linux and Windows. To easily handle fingerprints, I suggest the use of the fingerprint package. There are also methods to easily access atoms, bonds, molecule properties and so on.

Chemistry, Clouds, Collaboration (Part 2)

In my previous post I talked mainly about why there isn’t a large showing of chemistry in the cloud. It was based of Deepaks post and a FriendFeed thread, but really only addressed the first two words of the title. The issue of collaboration came up in the FriendFeed thread via some comments from Matthew Todd. He asked

I am also interested in why there are so few distributed chemistry collaborations – i.e. those involving the actual synthesis of chemical compounds and their evaluation. Does it come down to data sharing tools?

The term “distributed chemistry collaborations” arises, partly, from a recent paper. But one might say that the idea of distributed collaborations is already here. Chemists have been collaborating in variety of ways, though many of these collaborations are small and focused (say between two or three people).

I get the feeling that Matthew is talking about larger collaborations, something on the lines of the CombiUgi project or the ONS Challenge. I think there are a number of factors that might explain why we don’t see more such large, distributed chemistry collaborations.

First, there is the issue of IP and credit. How will it get apportioned? If each collaborator is providing a specific set of skills, I can see it being relatively simple. But then it also sounds like pretty much any current collaboration. What happens when multiple people are synthesizing different compounds? And you have multiple people doing assays? How is work dividied? How is credit received? And are large, loosely managed groups even efficient? Of course, one could compare the scenario to many large Open Source projects and their management issues.

Second, I think data sharing tools are a factor. How do collaborations (especially those without an informatics component) efficiently share information? Probably Excel :( – but there are a number of efforts such as CDD and ChemSpider which are making it much easier for chemists to share chemical information.

A third factor that is somewhat related to the previous point is that academic chemistry has somewhat ignored the informatics aspects of chemistry (both as infrastructure topic as well as a research area). I think this is partly related to the scale of academic chemistry. Certainly, many topics in chemical research do not require informatics capabilities (compared to say ab initio computational capabilities). But there are a number of areas, such as the type that Matthew notes, that can greatly benefit from an efficient informatics infrastructure. I certainly won’t say that it’s all there and ready to use – but I think it’s important cheminformatics plays a role. In this sense, one could say that there would be many more distributed collaborations, if the chemists knew that there was an infrastructure that could help their efforts. I will also note that it’s not just about infrastructure – while important, it’s also pretty straightforward IT (given some domain knowledge). I do think that there is a lot more to cheminformatics than just setting up databases, that can support bench chemistry efforts. Industry realizes this. Academia hasn’t so much (at least yet).

Which leads me to the fourth factor, which is social. Maybe the reason for the lack of such collaborations is there chemists just don’t have a good way of getting the word out they are available and/or interested. Certainly, things like FriendFeed are a venue for things like this to happen, but given that most academic chemists are conservative, it may take time for this to pick up speed.

Chemistry, Clouds, Collaboration (Part 1)

There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.

At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.

But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?

While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.

But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.

Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?

Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3  million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.

So this leads to the question: what is big enough to make use of the cloud?

What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.

So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include

  • Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
  • Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
  • Many chemistry problems are not large in terms of data size, compared to other life science applications
  • Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)

Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.