So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for March, 2009

Manipulating SMARTS Queries

with 3 comments

Yesterday, Andreas Maunz asked a question on the openbabel-discuss list:

… a possibility to combine two distinct smarts patterns. Of course, there is the comma OR operator (s1,s2), but I was thinking of a more sophisticated combination, where equal parts of s1 and s2 are “mapped onto each other” in a way such that merely the exclusive parts of s1 and s2 are connected via OR to the common core structure, where possible?

Fundamentally, a solution requires us to be able to manipulate and transform the SMARTS query in various ways. Operations of these types have been been discussed by Roger Sayle in a MUG presentation and these are available in the OpenEye tools. As far as I know, there are no open source implementations of such SMARTS optimizations (or manipulations). Later on in Andres’ thread, Andrew Dalke suggested that one could attack this by performing an MCSS calculation on two graphs representing the two queries, along with some appropriate normalizations.

It turns out that this type of manipulation isn’t too difficult in the CDK, thanks to Dazhi Jiao’s work on the JJTree based SMARTS parser. Since the parser generates an abstract syntax tree, the SmartQueryVisitor is able to walk the tree and generate an IQueryAtomContainer. Basically this is analogous to a molecule (i.e., IAtomContainer) and hence we can apply pre-exisiting code for substructure searches. Thus, to find the largest common query fragment between two SMARTS queries we can do:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import org.openscience.cdk.smiles.smarts.parser.SMARTSParser;
import org.openscience.cdk.smiles.smarts.parser.ASTStart;
import org.openscience.cdk.smiles.smarts.parser.ParseException;
import org.openscience.cdk.smiles.smarts.parser.visitor.SmartsQueryVisitor;
import org.openscience.cdk.isomorphism.matchers.IQueryAtomContainer;
import org.openscience.cdk.isomorphism.UniversalIsomorphismTester;
import org.openscience.cdk.interfaces.IAtom;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.interfaces.IBond;
import org.openscience.cdk.exception.CDKException;

import java.util.List;

public class SMARTSManipulator {

    public SMARTSManipulator() {
    }

    private IQueryAtomContainer getQueryMolecule(String smarts) throws ParseException {
        SMARTSParser parser = new SMARTSParser(new java.io.StringReader(smarts));
        ASTStart ast = parser.Start();
        SmartsQueryVisitor visitor = new SmartsQueryVisitor();
        return (IQueryAtomContainer) visitor.visit(ast, null);
    }

    // helper to convert the IQueryAtomContainer to an IAtomContainer, to make the UIT
    // happy. This is OK, since the atoms and bonds in the IAtomContainer are still the query
    // atoms and query bonds
    private IAtomContainer convertToAtomContainer(IQueryAtomContainer queryContainer) {
        IAtomContainer x = queryContainer.getBuilder().newAtomContainer();
        for (IAtom atom : queryContainer.atoms()) x.addAtom(atom);
        for (IBond bond : queryContainer.bonds()) x.addBond(bond);
        return x;
    }

    public void smartsMCSS() throws ParseException, CDKException {

        IQueryAtomContainer queryTarget = getQueryMolecule("c1ccccc1C(=O)C*");
        IQueryAtomContainer queryQuery = getQueryMolecule("*C([N,S])C=O");


        boolean isSubgraph = UniversalIsomorphismTester.isSubgraph(convertToAtomContainer(queryTarget), queryQuery);
        System.out.println("isSubgraph = " + isSubgraph);
        System.out.println("");

        List<IAtomContainer> mcss = UniversalIsomorphismTester.getOverlaps(
                convertToAtomContainer(queryTarget),
                convertToAtomContainer(queryQuery));

        System.out.println("mcss.size() = " + mcss.size() + "\n");

        for (IAtomContainer aMcs : mcss) {
            System.out.println("aMcs.getAtomCount() = " + aMcs.getAtomCount());
            for (IAtom matchingAtom : aMcs.atoms())
                System.out.println("matchingAtom = " + matchingAtom);
            System.out.println("");
        }
    }

    public static void main(String[] args) throws ParseException, CDKException {
        SMARTSManipulator m = new SMARTSManipulator();
        m.smartsMCSS();
    }
}

Note that we convert the IQueryAtomContainer to IAtomContainer, since the UniversalIsomorphism tester expects that we are performing queries against an actual (i.e., real) molecule.

As expected the smaller query is not a subgraph of the larger query. But when we evaluate the MCSS, we do end up with a query fragment consisting of *CC=O.

Now this solution is just a quick hack to see if it could be done – it will fail on more general queries (such as C[#6]N and CCN) since we don’t perform any normalization. Furthermore, there are more sophisticated optimizations that could be done on the AST directly but that’s for another night

Written by Rajarshi Guha

March 7th, 2009 at 7:10 am

Easy Parallelization With Clojure

with 6 comments

As I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides some fancy features that allow one to write complex parallel programs. I’m certainly no expert on that topic. However, one thing that I do everyday is perform operations on elements of a list. Traditionally, this is a serial operation. But what’d be nice is to have my compiler (or environment) perform this operation in parallel over the elements of the list. Clojure provides a very simple way to do this – pmap.

The map form simply applies a function to the elements of a list (or corresponding elements of multiple lists) in order, returning a list. By prepending the “p”, this is done in parallel making use of as many cores as are present on your system (but see below). Given the ease of this operation, lets see what we can do with pmap and the CDK.

Based on the previous post, lets calculate fingerprints in a serial fashion followed by the parallel version and see how long it takes. For completeness, I’ll repeat the code from the previous post. First import our packages and set up some basic objects and functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint MACCSFingerprinter))
(import '(org.openscience.cdk.fingerprint Fingerprinter))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))
;; so we can read lines from a file
(use 'clojure.contrib.duck-streams)

(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))
(def fprinter (new ExtendedFingerprinter))

(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

Next, we load the 4,688 molecules from the data file I described previously. In contrast to before, this code is slightly shorter (thanks to Nik) but also uses the doall form. This forces evaluation of the list, so that the molecules are all loaded into memory. In the timing code we again use it, since if we don’t, we just get the time for the list creation step (which is “instantaneous” due to lazy evaluation), rather than the actual list evaluation.

1
2
(def mols (doall (map #(getmol (. % trim))
              (read-lines "junk.smi"))))

Now, we can evaluate the fingerprints and time the operation. Initially I performed these calculations on my Macbook Pro with 2GB RAM and a dual core CPU.

1
(time (def fpserial (doall (map getfp mols))))

This run took 38.8 s (averaged over three runs). Next we consider the parallel version

1
(time (def fpparallel (doall (pmap getfp mols))))

This version has an average run time of 23.4 s – a 1.6x speedup. Now, it’s not exactly a two-fold speedup. Part of the reason is that there is some overhead for the threads. Also, even in the serial version, the garbage collector takes up some of the second core and in the parallel version, this will contend with the actual calculation.

Just to be sure that the calculation works OK, lets compare (via BitSet.equals) the fingerprints obtained using the two versions. We expect the result of the code below to be 0

1
2
3
4
5
(count (filter #(if (not %) %)
           (map
        (fn [x,y]
            (. x (equals y)))
        fpserial fpparallel)))

and that’s exactly what we get.

What about using more cores? I have access to some dual-CPU machines with 8GB of RAM, each CPU having four cores. Repeating the above calculations, the serial version takes 28.1 s and the parallel version takes 8.6 s, a 3.2x speedup. One thing I noted was that this really only uses the cores on one CPU, rather than all eight cores.

One thing that will require more investigation is to what extent we can make use of the CDK in parallel environments, since the library was not designed with thread-safety in mind. For example, parsing the SMILES strings using pmap (after reading in all the lines from the file) gives me an ExecutionException error.

In any case, it’s very cool that I can use multiple cores just converting map to pmap.

Written by Rajarshi Guha

March 6th, 2009 at 5:15 pm

Posted in cheminformatics,software

Tagged with , , ,

Testing CDK Fingerprints with Clojure

with 10 comments

Sometime back I was bored and thought that learning Lisp would be a good past time (maybe I should get a life?). I started of with SBCL and then discovered Clojure, a Lisp dialect that compiles to the JVM. The nice thing about this is that it allows one to write Lisp but also interact quite easily with Java libraries. As a result, I can get down to writing code without having to reimplement a bunch of stuff (such as cheminformatics methods for example).

There’s been quite a buzz about this implementation in the Lisp community. It so happens there have recently been a blog post and FF threads about the possible use of Clojure in bioinformatics. As Bosco pointed out, one reason for Clojures’ attractiveness is that  it has good support for concurrent programming – but I haven’t gotten that far yet.

So over the last few months I’ve been playing on and off with Clojure and the CDK – mostly small stuff to get the feel for Lisp. Today I came across a blog post indicating that the CDK fingerprints weren’t doing very well at identifying substructures.

C(C)(C)OCC(C)=C

C(C)(C)OCC(C)=C

As noted in the post, the EState and MACCS don’t do very well which is not too surprising. However the post seemed to imply that the hashed fingerprints weren’t working well.

This seemed like an opportunity to practice my Lisp skills. I downloaded the small molecule dataset from DrugBank and converted it to SMILES with OpenBabel. I did some quick filtering to remove a few molecules, resulting in 4,688 SMILES. For this test I just considered one of the query structures from the original post, shown alongside.

So the first step is to import some useful Java classes and Clojure packages

1
2
3
4
5
6
7
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))

;; so we can easily read lines from a file
(use 'clojure.contrib.duck-streams)

We then setup some of the objects that we’ll use such as the SMILES parser and fingerprinter

1
2
3
4
(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))

;; Use the default settings for the fingerprinter
(def fprinter (new ExtendedFingerprinter))

Finally, we create some simple helper methods

1
2
(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

OK, so now we’re ready to play. The first step is to create a fingerprint for the query molecule

1
2
(def querySmiles "C(C)(C)OCC(C)=C")
(def queryfp (getfp (getmol querySmiles)))

Next, we load the molecules by reading each line from the SMILES file and parsing the string into a molecule.

1
2
3
4
(def mols
     (map (fn [x]
        (getmol (. x trim)))
      (read-lines "junk.smi")))

The nice thing about Clojure is that lists are lazy – so if you execute this statement it returns immediately. The molecules will only be parsed when you start accessing the list and even then, it only parses those molecules that are actually accessed. So in other words the following code will give you the first molecule in the list, but all the rest will still remain “unparsed”

1
(first mols)

So given our molecule list, we first want to check whether the query molecule is a substructure using subgraph isomorphism. We can do this using the SMARTSQueryTool as follows

1
2
3
4
5
(def sqt (new SMARTSQueryTool querySmiles))
(def matches
     (map (fn [x]
        (. sqt (matches x)))
      mols))

The result is that matches is a (lazy) list of boolean values. We’d like to see how many of these values are true

1
(count (filter (fn [x] (if x x)) matches))

This gives us 7 (which also matches the result of obgrep). So clearly our query exists as a substructure in this target collection. How do the fingerprints perform?

First we make a helper function to check whether the query fingerprint is contained within the target fingerprint. This is performed by and’ing the two fingerprints and checking whether the result equals the query fingerprint.

1
2
3
(defn issubstruct? [query,target]
  (let [x (. target (and query))]
    (. target (equals query))))

The let form in this function is key – the and method of Java’s BitSet class modifies the object rather than returning the resultant BitSet. This is contrary to the functional paradigm (since a truly functional language does not modify objects). As a result, the return value of (. target (and query)) is nil, and if you then use this in equals, you get a NullPointerException. The let form simply assigns this nil value to x, which we then forget about. But we’re interested in the side-effect – the variable target is now the result of target AND query.

Using this function we can then easily check whether the query fingerprint is contained within the target fingerprints by generating the target fingerprints on the fly and calling issubstruct?

1
2
3
4
5
(def fpmatches
     (map (fn [x]
        (issubstruct?
         queryfp (getfp x)))
      mols))

As before, we want to identify how many values are true

1
(count (filter (fn [x] (if x x)) fpmatches))

which gives us 83.

The fact that we get more than 7 hits using this method is OK since we’re using hashed fingerprints. So it seems that the CDK fingerprints (at least the extended fingerprinter) are working OK. On looking at the posted results however, some queries are not returning enough fingerprint matches. To properly answer this issue we’ll need to look at the original data.

But as a quick example of using Clojure for some reasonable task, this turned out pretty nicely. While I probably won’t be switching to Clojure for full time work, it is a fun platform to play with. (Also, there may be a possibility that R might be going to a more Lisp like form in the future, so this is good practice anyway!)

Written by Rajarshi Guha

March 6th, 2009 at 1:32 am