Testing CDK Fingerprints with Clojure

Sometime back I was bored and thought that learning Lisp would be a good past time (maybe I should get a life?). I started of with SBCL and then discovered Clojure, a Lisp dialect that compiles to the JVM. The nice thing about this is that it allows one to write Lisp but also interact quite easily with Java libraries. As a result, I can get down to writing code without having to reimplement a bunch of stuff (such as cheminformatics methods for example).

There’s been quite a buzz about this implementation in the Lisp community. It so happens there have recently been a blog post and FF threads about the possible use of Clojure in bioinformatics. As Bosco pointed out, one reason for Clojures’ attractiveness is thatĀ  it has good support for concurrent programming – but I haven’t gotten that far yet.

So over the last few months I’ve been playing on and off with Clojure and the CDK – mostly small stuff to get the feel for Lisp. Today I came across a blog post indicating that the CDK fingerprints weren’t doing very well at identifying substructures.



As noted in the post, the EState and MACCS don’t do very well which is not too surprising. However the post seemed to imply that the hashed fingerprints weren’t working well.

This seemed like an opportunity to practice my Lisp skills. I downloaded the small molecule dataset from DrugBank and converted it to SMILES with OpenBabel. I did some quick filtering to remove a few molecules, resulting in 4,688 SMILES. For this test I just considered one of the query structures from the original post, shown alongside.

So the first step is to import some useful Java classes and Clojure packages

(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))

;; so we can easily read lines from a file
(use 'clojure.contrib.duck-streams)

We then setup some of the objects that we’ll use such as the SMILES parser and fingerprinter

(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))

;; Use the default settings for the fingerprinter
(def fprinter (new ExtendedFingerprinter))

Finally, we create some simple helper methods

(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

OK, so now we’re ready to play. The first step is to create a fingerprint for the query molecule

(def querySmiles "C(C)(C)OCC(C)=C")
(def queryfp (getfp (getmol querySmiles)))

Next, we load the molecules by reading each line from the SMILES file and parsing the string into a molecule.

(def mols
     (map (fn [x]
        (getmol (. x trim)))
      (read-lines "junk.smi")))

The nice thing about Clojure is that lists are lazy – so if you execute this statement it returns immediately. The molecules will only be parsed when you start accessing the list and even then, it only parses those molecules that are actually accessed. So in other words the following code will give you the first molecule in the list, but all the rest will still remain “unparsed”

(first mols)

So given our molecule list, we first want to check whether the query molecule is a substructure using subgraph isomorphism. We can do this using the SMARTSQueryTool as follows

(def sqt (new SMARTSQueryTool querySmiles))
(def matches
     (map (fn [x]
        (. sqt (matches x)))

The result is that matches is a (lazy) list of boolean values. We’d like to see how many of these values are true

(count (filter (fn [x] (if x x)) matches))

This gives us 7 (which also matches the result of obgrep). So clearly our query exists as a substructure in this target collection. How do the fingerprints perform?

First we make a helper function to check whether the query fingerprint is contained within the target fingerprint. This is performed by and’ing the two fingerprints and checking whether the result equals the query fingerprint.

(defn issubstruct? [query,target]
  (let [x (. target (and query))]
    (. target (equals query))))

The let form in this function is key – the and method of Java’s BitSet class modifies the object rather than returning the resultant BitSet. This is contrary to the functional paradigm (since a truly functional language does not modify objects). As a result, the return value of (. target (and query)) is nil, and if you then use this in equals, you get a NullPointerException. The let form simply assigns this nil value to x, which we then forget about. But we’re interested in the side-effect – the variable target is now the result of target AND query.

Using this function we can then easily check whether the query fingerprint is contained within the target fingerprints by generating the target fingerprints on the fly and calling issubstruct?

(def fpmatches
     (map (fn [x]
         queryfp (getfp x)))

As before, we want to identify how many values are true

(count (filter (fn [x] (if x x)) fpmatches))

which gives us 83.

The fact that we get more than 7 hits using this method is OK since we’re using hashed fingerprints. So it seems that the CDK fingerprints (at least the extended fingerprinter) are working OK. On looking at the posted results however, some queries are not returning enough fingerprint matches. To properly answer this issue we’ll need to look at the original data.

But as a quick example of using Clojure for some reasonable task, this turned out pretty nicely. While I probably won’t be switching to Clojure for full time work, it is a fun platform to play with. (Also, there may be a possibility that R might be going to a more Lisp like form in the future, so this is good practice anyway!)

10 thoughts on “Testing CDK Fingerprints with Clojure

  1. Great write up! Thanx!

  2. BTW, it the above automatically run in parallel? Does it automatically detect that it can do that for the loops?

  3. Nik says:

    Thank you for the excellent post, Clojure for bio/chemoinformatics would be fantastic!
    Spent a few minutes wrapping the code into functions … hope this is useful.

    Perhaps using pfilter and pmap from the parallel library in clojure would help to make it use all cores automatically.

    ;(use ‘cdk)
    ;(def mols (read-smi “junk.smi”))
    ;(count (get-match mols “C(C)(C)OCC(C)=C” match-sm?))
    ;(count (get-match mols “C(C)(C)OCC(C)=C” match-fp?))

    (ns cdk
    (:use (clojure.contrib duck-streams))
    (org.openscience.cdk Molecule AtomContainer)
    (org.openscience.cdk.smiles.smarts SMARTSQueryTool)
    (org.openscience.cdk.smiles SmilesParser)
    (org.openscience.cdk DefaultChemObjectBuilder)
    (org.openscience.cdk.fingerprint ExtendedFingerprinter) ))

    (defn getmol
    “Parses a smiles string”
    (.parseSmiles (SmilesParser.
    (DefaultChemObjectBuilder/getInstance)) smiles))

    (defn getfp
    “Returns extended CDK fingerprint”
    (. (ExtendedFingerprinter.) (getFingerprint mol)))

    (defn read-smi
    “Reads a file with SMILES”
    (map #(getmol (. % trim)) (read-lines file)))

    (defn match-sm?
    “Checks for a SMARTS match”
    [target query]
    (. (SMARTSQueryTool. query) (matches target)))

    (defn match-fp?
    “Checks for a fingerprint match”
    [target query]
    (. (getfp target) (equals (getfp (getmol query)))))

    (defn get-match
    “Returns SMARTS substructure matches”
    [mols query match-fn?]
    (filter #(match-fn? % query) mols))

  4. Egon, apparently the map statement can be converted to pmap – which will automatically use as many cores as you have. I’ve done some experiments which show that pmap is slower than map – I think due to thread overheads. At the same time, stuff like SMILES parsing fails in a parallel mode since the class is not thread safe. Another blog post will talk about this

  5. Nik, thanks a lot for the code. Being a Lisp newbie, I appreciate looking at good examples :)

    Can you explain what this means?

    (map #(getmol (. % trim)) (read-lines file)))

    specifically, what does the ‘#’ do? and the ‘%’?

  6. Nik says:

    Dear Rajarshi,
    Enjoying learning Clojure too! According to http://clojure.org/reader, towards the end of the page), the # in
    (map #(getmol (. % trim) (read-lines file)))
    the # is a Clojure shorthand (reader-macro) for anonymous function and % is its argument; so this would be equivalent to:
    (map (fn [x] (getmol (. x trim))) (read-lines file)))
    but shorter.
    He-he, just noticed it can be shortened even further:
    (map #(getmol (.trim %) (read-lines file))) …
    Perhaps it could be of interest to wrap more of the CDK functionality into clojure and maybe make it available in clojure-contrib…

  7. Patrick Sullivan says:

    I’m not Nik but I can answer the questions

    #() is the reader macro for making a lambda function, and % is to take a single argument passed into it and use it in that spot.

    So it’s like doing (fn [x] (getmol (. x trim)))

  8. […] I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides […]

  9. Nik and Patrick, thanks for the explanation.

    Nik, yes, wrapping some bits of the CDK in simple functions might be useful – but the way that Clojure works with Java libraries is pretty nice (after having worked with Java in R which is painful and *requires* wrappers!). WRT, clojure-contrib, I’m not sure that this is general interest for inclusion there (?)

  10. […] sorry for the incorrect post. I’ve run the test again, using the SMILESĀ provided by Guha as input. The groovy script is also attached […]

Leave a Reply

Your email address will not be published. Required fields are marked *