So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘fingerprint’ tag

Benchmarking the CDK Hybridization Fingerprinter

without comments

This morning Egon reported that he had implemented a new fingerprinter for the CDK, which only considered hybridization rather than looking at aromaticity. As a result this approach does not require aromaticity perception. I took a quick look to see how it performs in a virtual screening benchmark. Firstly, it’s faster than the other CDK hashed fingerprints – 15,030 fingerprint calculations took ~ 60s with the hybridization only fingerprint. In contrast the extended fingerprint took 80s for the same set of molecules. To test the utility of the fingerprint in a virtual screening scenario I evaluated enrichment curves (see here for a comprehensive comparison of CDK fingerprints) using the AID 692 MUV benchmark dataset. The plots below show the enrichment curves for the first 5% of the database and the entire database. The red curve corresponds to random selections. (In this experiment the database consists of 15,000 decoys and 30 actives). The enrichment factor for the standard, extended and hybiridization only fingerprints were 0.94, 1.06 and 1.38 respectively.

Overall, the hybridization only fingerprint performs comparably to the extended fingerprint and better than the standard one. But at a small percentage of the database screened, it appears that this fingerprint outperforms both. Of course, this is only one dataset, and more MUV datasets should be analyzed to get a more comprehensive view.

   

Written by Rajarshi Guha

July 17th, 2010 at 1:59 am

Update to the fingerprint Package

with one comment

I’ve just uploaded a new version of the fingerprint package (v3.3) to CRAN that implements some ideas described in Nisius and Bajorath. First, the balance method generates “balanced code” fingerprints, which given an input fingerprint of N bits, returns a new fingerprint of 2N bits, such that the bit density is exactly 50%. Second, bit.importance is a method to evaluate the importance of each bit in a fingerprint, in terms of the Kullback-Liebler divergence between a collection of actives and background molecules. In other words, the method ranks the bits in terms of their ability to discriminate between the actives and the background molecules.

Written by Rajarshi Guha

June 3rd, 2010 at 1:07 am

A Quick Look at the GSK Malaria Dataset

with 5 comments

A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.

There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.

I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.

The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.

Written by Rajarshi Guha

May 23rd, 2010 at 1:02 pm

Slides from a Guest Lecture at Drexel University

without comments

On Thursay I joined Antony Williams as a guest lecturer in Jean Claude-Bradleysclass on chemical information retrieval at Drexel University. Using a combination of WebEx and Skype, we were able to give our presentations – seamlessly joining three different locations. Technology is great! Tony gave an excellent talk on citizen science and ChemSpider and I spoke about similarity and searching. Jean Claude has also put up an audio version.

Written by Rajarshi Guha

December 5th, 2009 at 11:08 pm

Circular Fingerprints with the CDK and Clojure

with 2 comments

One of the things I’m liking about Clojure is that it can be used as a quick prototyping language, a lot like Python. This is quite handy when playing with the CDK, as I can avoid a lot of the verbosity of Java code. As an example, I put together a very simplistic circular fingerprint (exemplified by molprint2d and ECFP’s). In this case, the fingerprint is simply composed of atom type names and is quite similar to an implementation by Noel.

First, some imports and simple helper functions that allow me to work with native Clojure sequences rather than having to go via Java methods

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))

(use 'clojure.contrib.duck-streams)
(use 'clojure.contrib.str-utils)

;; surely there must be a core function for this!
(defn has
  "true if val exists in coll, nil otherwise. Uses ="
  [coll val]
  (some #(= % val) coll))

;; http://markmail.org/message/56r3eflx4a6tasoe
(defn flatten
  "Flatten a list which may contain lists"
  [x]
  (let [s? #(instance? clojure.lang.Sequential %)]
    (filter
     (complement s?)
     (tree-seq s? seq x))))

;; Maybe these could be converted to macro's?
(defn get-atom-list
  "Get the atoms of a molecule as a Clojure sequence"
  [mol]
  (map #(identity %) (. mol atoms)))

(defn get-connected-atoms-list
  "Get the atoms connected to a specified atom of a molecule
   as a Clojure sequence"

  [mol atom]
  (map #(identity %) (. mol (getConnectedAtomsList atom))))

(defn getmol
  "Parse a SMILES string and return the molecule object. This is
   quite inefficient since each time it is called it creates a new
   parser object. Handy for quick testing"

  [smiles]
  (. (new SmilesParser (DefaultChemObjectBuilder/getInstance))  
     (parseSmiles smiles)))

These go into my user.clj, so are available when I start Clojure. With these in hand, we need a function which will get us the atoms in successive shells around a specified atom. In other words, atoms in the first shell are those connected directly to the specified atom. Atoms in the second shell lie two bonds away and so on. This is achieved by the following function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(defn get-shells
  "Get the atoms in successive shells surrounding a specified atom.
   The first shell equals the immediate neighbors, the second shell
   the neighbors at bond length 2 and so on. The return value is a list
   in which each element is a list of atoms. The first element of the
   top level list is the n'th shell. Note that the return value does not
   contain the starting atom (i.e., zeroth shell)"

  [mol start depth]
  (loop [shells (list (list start))
        visited (list)]
    (if (= (count shells) (+ 1 depth))
      (drop-last shells)
      (recur (conj shells
           (filter #(not (has visited %))
               (flatten (map #(get-connected-atoms-list mol %) (first shells)))))
         (set (flatten (list visited shells)))))))

Here, depth indicates how many shells we want to go out to. The return value is a list of lists, each sublist representing the atoms in a shell. The first element corresponds to the n’th shell. Note that the zeroth shell (i.e., the start atom) is not contained in the return value.

Before applying this function to all the atoms, we need to consider how we’re going to represent each shell. For example ECFP’s consider various atom properties (degree, H-bonding ability etc). But a very simple procedure is to simply replace each atom in a shell with its atom type name, sort the names and join them into a single string. The end result would be, for a single atom, a list of strings corresponding to the shells around that atom.

First, we have a simple function to convert a list of atoms into a list of CDK atom type names

1
2
3
(defn atype [atoms]
  (map #(. % getAtomTypeName)
       atoms))

Using this, we can then use it to convert the lists of “atom shells” around a given atom to a list of atom type strings

1
2
3
4
5
(defn make-atom-fp
  [mol atom depth]
  (map #(str-join "-" (sort %))
       (map atype
           (get-shells mol atom depth))))

The return value is still a list of strings representing each shell around the specified atom. In practice this list of strings would likely be reduced to a single string or number. The last thing to do is to apply the function to all the atoms in a molecule (using a depth of 2 for example)

1
(map #(make-atom-fp mol % 2) (get-atom-list mol))

An example of using these functions is:

1
2
3
4
user=> (def mol (getmol "C(O)(F)CN"))
#'user/mol
user=> (map #(make-atom-fp mol % 2) (get-atom-list mol))
(("N.sp3" "C.sp3-F-O.sp3") ("C.sp3-F" "C.sp3") ("C.sp3-O.sp3" "C.sp3") ("F-O.sp3" "C.sp3-N.sp3") ("C.sp3" "C.sp3"))

Alternatively, we could reduce this to a list of strings, one for each atom, by simply joining the strings for each shell of a given atom

1
2
user=> (map #(str-join "-" %) (map #(make-atom-fp mol % 2) (get-atom-list mol)))
("N.sp3-C.sp3-F-O.sp3" "C.sp3-F-C.sp3" "C.sp3-O.sp3-C.sp3" "F-O.sp3-C.sp3-N.sp3" "C.sp3-C.sp3")

This is a very simple example, and as I’m still wrapping my head around Lisp, it took some time to write. But once you get the hang of loop and recur, things start flowing!

Written by Rajarshi Guha

March 8th, 2009 at 8:54 pm