So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘Uncategorized’ Category

The Move is Complete

with 5 comments

The move to Rockville is complete and I’ve started work at the NCGC. This week has been a bit hectic what with setting up house and getting up to speed at work. But I’m really excited with the stuff that’s going on here – lots of interesting projects in various areas of chemical biology and genomics. Also, being surrounded by some really smart people is invigorating (and leads to some nice lunch discussions). One aspect of the center that I really like is the commitment to Open Source – while there isn’t a whole lot at the moment, there’s some cool stuff coming down the pipeline. And a 30″ Apple Cinema display is sweet.

For now, back to reading up on HTS, RNAi and browsing for sofas.

Written by Rajarshi Guha

May 23rd, 2009 at 4:07 am

Posted in Uncategorized

Hadoop, Chunks and Multi-line Records

with 7 comments

Chunking an input file

Chunking an input file

In a previous post I described how one requires a custom RecordReader class to deal with multi-line records  (such as SD files) in a Hadoop program. While it worked fine on a small input file (less than 5MB) I had not addressed the issue of “chunking” and that caused it to fail when dealing with larger files (the code in that post is updated now).

When a Hadoop program is run on an input file, the framework will send chunks of the input file to individual RecordReader instances. Note that it doesn’t actually read the entire file and send around portions of it – that would not scale to petabyte files! Rather, it determines the size of the file and ends start and end offsets into the original file, to the RecordReaders. They then seek to the appropriate position in the original file and then do their work.

The problem with this is that when a RecordReader receives a chunk (defined in terms of start and offsets), it can start in the middle of a record and end in the middle of another record. This shown schematically in the figure, where the input file with 5 multi-line, variable length records is divided into 5 chunks. As you can see, in the general case, chunks don’t start or end on record boundaries.

My initial code, when faced with chunks failed badly since rather than recognizing chunk boundaries it simply read each record in the whole file. Alternatively (and naively) if one simply reads up to a chunk boundary, the last and first records read from that chunk will generally be invalid.

The correct (and simple) strategy for an arbitrary chunk, is to make sure that the start position is not 0. If so, we read the bytes from the start position until we reach the first end of record marker. In general, the record we just read will be incomplete, so we discard it. We then carry on reading complete records as usual. But if, after reading a record, we note that the current file position is beyond the end position of the current chunk, we note that the chunk is done with and just return this last record. Thus, according to the figure, when processing he second chunk from the top, we read in bytes 101 to 120 and discard that data. We then start reading the initial portion of Record 3 until the end of the record, at position 250 – even though we’ve gone beyond the chunk boundary at position 200. However we now flag that we’re done with the chunk and carry on.

When another RecordReader class gets the next chunk starting at position 200, it will be dumped into the middle of Record 3. But, according to our strategy, we simply read till the end of record marker at position 250 and discard the data. This is OK, since the RecordReader instance that handled the previous chunk already read the whole of Record 3.

The two edge cases here are when the chunk starts at position 0 (beginning of the input file) and the chunk ends at the end of file. In the former case, we don’t discard anything, but simply process the entire chunk plus a bit beyond it to get the entire last record for this chunk. For the latter case, we simply check whether we’re at the end of the file and flag it to the nextKeyValue() method.

The implementation of this strategy is shown in the SDFRecordReader class listing.

In hindsight this is pretty obvious, but I was bashing myself for a while and hopefully this explanation saves others some effort.

Written by Rajarshi Guha

May 6th, 2009 at 5:00 am

Posted in software,Uncategorized

Tagged with ,

Change – Yes, I Did

with 12 comments

I’ve been at IU for 3 years now, the last two as a visiting faculty member. However the time has come for a change. I have accepted a position at the NIH Chemical Genomics Center working in the informatics group and will start there in mid-May.

I’m pretty excited about this move, as the group is doing some pretty cool stuff with cheminformatics and high-throughput assays. The group releases their code into the public domain and you can see some of their stuff here. From a scientific point of view, this is an excellent place to do cheminformatics (as well as informatics and computation in general) because the biologists and chemists are right next door. This means access to fresh, raw data as well as the possibility of working with wet chemistry and biology to test out in silico analyses and predictions. (They also have a neat robotics set up, also next door).

As part of the informatics group, I will have fingers in many pies – modeling and computation, developing software, data management and project support. I’m particularly excited about the forthcoming RNAi screening infrastructure that is being set up at the Center.

What happens to the web services?

All services are currently running, but will not be available at the rguha.ath.cx URL’s after mid May. The services have been transferred to David Wild and I will try and redirect URL’s before I leave – if you find that a service you are using is not working, get in touch with David. However, the source code for the SOAP services are freely available on SourceForge, so you can always deploy them into your local Tomcat container. However, I do not plan on supporting these services any longer.

As I have described previously, I am shifting to REST based services. Thus all the CDK based cheminformatics services are converted to REST forms and are currently hosted of rguha.ath.cx, but after I move, this URL will no longer be available. However, running these services is very easy and is described here. If anybody is interested in hosting these REST services, please get in touch with me. In terms of future support, I plan to work on these REST services in my spare time. Note that these services replace a number of my mod_python based REST services (see the GitHub repository which I will updated shortly to remove deprecated services). In general, I plan to maintain the Python REST services. In particular, some of these  services are still front-ends to SOAP services. Where possible, I will reomove the dependency on the SOAP service. The prediction service is something I plan to work on as it is a very light weight way to deploy predictive models built using R (and less hassle than going via the SOAP based R infrastructure). Unfortunately, until I work out some server solution, the predictive model service will likely not be publicly accessible (as a service, that is).

Databases

Over the last three years I’ve been maintaining a partial mirror of PubChem and the Pub3D database. After my move I will no longer be handling these, so questions should be sent to David. A number of services that I provide make use of these databases. Examples include, 3D structure search, InChI/InChI key  resolution, synonyms and so on. These services will continue to run, though the URL’s will likely change. – as before, if you see something is down, David should be able to point you in the right direction. I will likely continue work on the 3D database (especially the similarity search aspects, which is currently flawed), but at this point I don’t know when or how new results will be posted.

I should point out that future work on the projects described here are spare-time projects and not part of my official duties at NCGC. So progress on these is not going to be particularly fast.

Finally, my website has moved to http://rguha.net and the old one should automatically redirect you to the new one.

Written by Rajarshi Guha

March 13th, 2009 at 1:10 pm

Posted in Uncategorized

Circular Fingerprints with the CDK and Clojure

with 2 comments

One of the things I’m liking about Clojure is that it can be used as a quick prototyping language, a lot like Python. This is quite handy when playing with the CDK, as I can avoid a lot of the verbosity of Java code. As an example, I put together a very simplistic circular fingerprint (exemplified by molprint2d and ECFP’s). In this case, the fingerprint is simply composed of atom type names and is quite similar to an implementation by Noel.

First, some imports and simple helper functions that allow me to work with native Clojure sequences rather than having to go via Java methods

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))

(use 'clojure.contrib.duck-streams)
(use 'clojure.contrib.str-utils)

;; surely there must be a core function for this!
(defn has
  "true if val exists in coll, nil otherwise. Uses ="
  [coll val]
  (some #(= % val) coll))

;; http://markmail.org/message/56r3eflx4a6tasoe
(defn flatten
  "Flatten a list which may contain lists"
  [x]
  (let [s? #(instance? clojure.lang.Sequential %)]
    (filter
     (complement s?)
     (tree-seq s? seq x))))

;; Maybe these could be converted to macro's?
(defn get-atom-list
  "Get the atoms of a molecule as a Clojure sequence"
  [mol]
  (map #(identity %) (. mol atoms)))

(defn get-connected-atoms-list
  "Get the atoms connected to a specified atom of a molecule
   as a Clojure sequence"

  [mol atom]
  (map #(identity %) (. mol (getConnectedAtomsList atom))))

(defn getmol
  "Parse a SMILES string and return the molecule object. This is
   quite inefficient since each time it is called it creates a new
   parser object. Handy for quick testing"

  [smiles]
  (. (new SmilesParser (DefaultChemObjectBuilder/getInstance))  
     (parseSmiles smiles)))

These go into my user.clj, so are available when I start Clojure. With these in hand, we need a function which will get us the atoms in successive shells around a specified atom. In other words, atoms in the first shell are those connected directly to the specified atom. Atoms in the second shell lie two bonds away and so on. This is achieved by the following function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(defn get-shells
  "Get the atoms in successive shells surrounding a specified atom.
   The first shell equals the immediate neighbors, the second shell
   the neighbors at bond length 2 and so on. The return value is a list
   in which each element is a list of atoms. The first element of the
   top level list is the n'th shell. Note that the return value does not
   contain the starting atom (i.e., zeroth shell)"

  [mol start depth]
  (loop [shells (list (list start))
        visited (list)]
    (if (= (count shells) (+ 1 depth))
      (drop-last shells)
      (recur (conj shells
           (filter #(not (has visited %))
               (flatten (map #(get-connected-atoms-list mol %) (first shells)))))
         (set (flatten (list visited shells)))))))

Here, depth indicates how many shells we want to go out to. The return value is a list of lists, each sublist representing the atoms in a shell. The first element corresponds to the n’th shell. Note that the zeroth shell (i.e., the start atom) is not contained in the return value.

Before applying this function to all the atoms, we need to consider how we’re going to represent each shell. For example ECFP’s consider various atom properties (degree, H-bonding ability etc). But a very simple procedure is to simply replace each atom in a shell with its atom type name, sort the names and join them into a single string. The end result would be, for a single atom, a list of strings corresponding to the shells around that atom.

First, we have a simple function to convert a list of atoms into a list of CDK atom type names

1
2
3
(defn atype [atoms]
  (map #(. % getAtomTypeName)
       atoms))

Using this, we can then use it to convert the lists of “atom shells” around a given atom to a list of atom type strings

1
2
3
4
5
(defn make-atom-fp
  [mol atom depth]
  (map #(str-join "-" (sort %))
       (map atype
           (get-shells mol atom depth))))

The return value is still a list of strings representing each shell around the specified atom. In practice this list of strings would likely be reduced to a single string or number. The last thing to do is to apply the function to all the atoms in a molecule (using a depth of 2 for example)

1
(map #(make-atom-fp mol % 2) (get-atom-list mol))

An example of using these functions is:

1
2
3
4
user=> (def mol (getmol "C(O)(F)CN"))
#'user/mol
user=> (map #(make-atom-fp mol % 2) (get-atom-list mol))
(("N.sp3" "C.sp3-F-O.sp3") ("C.sp3-F" "C.sp3") ("C.sp3-O.sp3" "C.sp3") ("F-O.sp3" "C.sp3-N.sp3") ("C.sp3" "C.sp3"))

Alternatively, we could reduce this to a list of strings, one for each atom, by simply joining the strings for each shell of a given atom

1
2
user=> (map #(str-join "-" %) (map #(make-atom-fp mol % 2) (get-atom-list mol)))
("N.sp3-C.sp3-F-O.sp3" "C.sp3-F-C.sp3" "C.sp3-O.sp3-C.sp3" "F-O.sp3-C.sp3-N.sp3" "C.sp3-C.sp3")

This is a very simple example, and as I’m still wrapping my head around Lisp, it took some time to write. But once you get the hang of loop and recur, things start flowing!

Written by Rajarshi Guha

March 8th, 2009 at 8:54 pm

Testing CDK Fingerprints with Clojure

with 10 comments

Sometime back I was bored and thought that learning Lisp would be a good past time (maybe I should get a life?). I started of with SBCL and then discovered Clojure, a Lisp dialect that compiles to the JVM. The nice thing about this is that it allows one to write Lisp but also interact quite easily with Java libraries. As a result, I can get down to writing code without having to reimplement a bunch of stuff (such as cheminformatics methods for example).

There’s been quite a buzz about this implementation in the Lisp community. It so happens there have recently been a blog post and FF threads about the possible use of Clojure in bioinformatics. As Bosco pointed out, one reason for Clojures’ attractiveness is that  it has good support for concurrent programming – but I haven’t gotten that far yet.

So over the last few months I’ve been playing on and off with Clojure and the CDK – mostly small stuff to get the feel for Lisp. Today I came across a blog post indicating that the CDK fingerprints weren’t doing very well at identifying substructures.

C(C)(C)OCC(C)=C

C(C)(C)OCC(C)=C

As noted in the post, the EState and MACCS don’t do very well which is not too surprising. However the post seemed to imply that the hashed fingerprints weren’t working well.

This seemed like an opportunity to practice my Lisp skills. I downloaded the small molecule dataset from DrugBank and converted it to SMILES with OpenBabel. I did some quick filtering to remove a few molecules, resulting in 4,688 SMILES. For this test I just considered one of the query structures from the original post, shown alongside.

So the first step is to import some useful Java classes and Clojure packages

1
2
3
4
5
6
7
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
(import '(org.openscience.cdk.fingerprint ExtendedFingerprinter))
(import '(org.openscience.cdk.smiles.smarts SMARTSQueryTool))

;; so we can easily read lines from a file
(use 'clojure.contrib.duck-streams)

We then setup some of the objects that we’ll use such as the SMILES parser and fingerprinter

1
2
3
4
(def sp (new SmilesParser (. DefaultChemObjectBuilder (getInstance))))

;; Use the default settings for the fingerprinter
(def fprinter (new ExtendedFingerprinter))

Finally, we create some simple helper methods

1
2
(defn getmol [smiles] (. sp (parseSmiles smiles)))
(defn getfp [mol] (. fprinter (getFingerprint mol)))

OK, so now we’re ready to play. The first step is to create a fingerprint for the query molecule

1
2
(def querySmiles "C(C)(C)OCC(C)=C")
(def queryfp (getfp (getmol querySmiles)))

Next, we load the molecules by reading each line from the SMILES file and parsing the string into a molecule.

1
2
3
4
(def mols
     (map (fn [x]
        (getmol (. x trim)))
      (read-lines "junk.smi")))

The nice thing about Clojure is that lists are lazy – so if you execute this statement it returns immediately. The molecules will only be parsed when you start accessing the list and even then, it only parses those molecules that are actually accessed. So in other words the following code will give you the first molecule in the list, but all the rest will still remain “unparsed”

1
(first mols)

So given our molecule list, we first want to check whether the query molecule is a substructure using subgraph isomorphism. We can do this using the SMARTSQueryTool as follows

1
2
3
4
5
(def sqt (new SMARTSQueryTool querySmiles))
(def matches
     (map (fn [x]
        (. sqt (matches x)))
      mols))

The result is that matches is a (lazy) list of boolean values. We’d like to see how many of these values are true

1
(count (filter (fn [x] (if x x)) matches))

This gives us 7 (which also matches the result of obgrep). So clearly our query exists as a substructure in this target collection. How do the fingerprints perform?

First we make a helper function to check whether the query fingerprint is contained within the target fingerprint. This is performed by and’ing the two fingerprints and checking whether the result equals the query fingerprint.

1
2
3
(defn issubstruct? [query,target]
  (let [x (. target (and query))]
    (. target (equals query))))

The let form in this function is key – the and method of Java’s BitSet class modifies the object rather than returning the resultant BitSet. This is contrary to the functional paradigm (since a truly functional language does not modify objects). As a result, the return value of (. target (and query)) is nil, and if you then use this in equals, you get a NullPointerException. The let form simply assigns this nil value to x, which we then forget about. But we’re interested in the side-effect – the variable target is now the result of target AND query.

Using this function we can then easily check whether the query fingerprint is contained within the target fingerprints by generating the target fingerprints on the fly and calling issubstruct?

1
2
3
4
5
(def fpmatches
     (map (fn [x]
        (issubstruct?
         queryfp (getfp x)))
      mols))

As before, we want to identify how many values are true

1
(count (filter (fn [x] (if x x)) fpmatches))

which gives us 83.

The fact that we get more than 7 hits using this method is OK since we’re using hashed fingerprints. So it seems that the CDK fingerprints (at least the extended fingerprinter) are working OK. On looking at the posted results however, some queries are not returning enough fingerprint matches. To properly answer this issue we’ll need to look at the original data.

But as a quick example of using Clojure for some reasonable task, this turned out pretty nicely. While I probably won’t be switching to Clojure for full time work, it is a fun platform to play with. (Also, there may be a possibility that R might be going to a more Lisp like form in the future, so this is good practice anyway!)

Written by Rajarshi Guha

March 6th, 2009 at 1:32 am