So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for May, 2010

Some More Comparisons with the GSK Dataset

without comments

My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.

I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).

I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).

This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.

For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).



Written by Rajarshi Guha

May 24th, 2010 at 2:47 am

A Quick Look at the GSK Malaria Dataset

with 5 comments

A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.

There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.

I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.

The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.

Written by Rajarshi Guha

May 23rd, 2010 at 1:02 pm

New Version of rcdk

without comments

Based on feedback from the recent R workshop at the EBI, I’ve updated the rcdk package to include more methods operating on atoms, a modification to parse.smiles to allow it to handle a vector of SMILES strings, which makes it more R-like (thanks to Tobias Verbeke for the patch). In addition, one can now load very large SMILES or SDF files using the iterating readers from the CDK. This feature makes use of the iterators package and lets you write code such as

iter <- iload.molecules('big.smi', type='smi')
while(hasNext(iter)) {
  mol <- nextElem(iter)
  print(, "cdk:Title")

As a result, only one molecule is loaded at a time, allowing one to process arbitrarily large files. Version 2.9.23 has been uploaded to CRAN and should be available in a day or two

Written by Rajarshi Guha

May 21st, 2010 at 11:43 pm

Posted in software

Tagged with ,

Spreading the Word About R & Cheminformatics

with 2 comments

These last few days I’ve been in the UK for an EBI workshop on cheminformatics in R. It was a two day workshop, the first day focusing on general cheminormatics in R using the rcdk and rpubchem packages, and the second day focusing on doing mass spectrometry in R using XCMS and Rdisop, run by Steffen Neumann and Paul Benton. It was an excellent workshop with participation from industry and academia and skill levels ranging from new R users to experts and people with minimal cheminformatics backgrounds to full time cheminformaticians. While I think my exercises might have been a little too difficult, I think we were able to cover a variety of topics ranging from details on how to do specific cheminformatics operations in R to more application oriented tasks such as fingerprint based analysis and benchmarking virtual screening methods. The slides from the workshop are available here – it’s a pretty big slide deck and covers some introductory R (there are some mistakes in that section which I will update in the coming days), and overview of the CDK and then sections on usage and applications of the rcdk and rpubchem packages. It certainly helped that I had a very friendly audience! During the course of the workshop I also learned a few things about R (thanks to Tobias Verbeke and Steffen). Given that about 40 people or so were exposed to the rcdk package, my (known) user base should hopefully increase :) It was nice to get a patch from Tobias during the workshop, which will be incorporated once I’m back home. It was also great to meet a number of people with whom I’d only had email or FriendFeed exchanges with in the past – including Chris Swain, Mark Rijnbeek, Duncan Hull, Nico Adams (though I didn’t realize it was him when I was speaking to him – sorry Nico!), Duan Lian and Syed Asad Rahman. I also got to briefly meet some of the ChEMBL folks (John and Patricia). Monday night we had a lovely workshop dinner at The Cricketer (Clavering). Many thanks to Gabriella Rustici and Dominic Clark for organizing this and inviting me to run the first day. The only downside of this trip? It was too short :) It would’ve been great to be able to stay a day or two more to have longer discussions with various groups.

In addition to the workshop, I visited Asad and his family in Cambridge for a fantastic dinner and much useful discussion. He’s done some excellent work on SMSD and showed me some of his recent work on enzyme classification and reaction mappings. I won’t say much more as he’s writing this up, except to say that it was quite impressive and I’m eagerly looking forward to seeing the writeups. Hopefully we’ll be able to do some joint work in the near future. Given the speed up that SMSD provides for graph isomorphism, I’m in the process of updating the CDK SMARTS parser to make use of it rather than the older UIT, which should improve SMARTS matching considerably. Down the road, the pharmacophore matching code will get a similar upgrade.

I was also able to squeeze in a day trip up to Harrogate, where I grew up. It was fun to see familiar streets and places after 23 years or so. It certainly didn’t hurt to also have some pretty amazing traditional English fare (Yorkshire curd tart at Bettys and the fish ‘n chips at Graveleys was fantastic).

Written by Rajarshi Guha

May 19th, 2010 at 12:21 pm

Posted in software,cheminformatics

Tagged with , , , , ,

2D Depictions in R Plots

with 3 comments

In preparation for the upcoming R workshop at the EBI, I’ve been cleaning up the rcdk package and updating some features. One of the new features is the ability to get a 2D depiction as a raster image. Uptil now, 2D depictions were drawn in a Swing window – this allowed you to resize the window but not much else. You really couldn’t use it for anything else but viewing.

However, R-2.11.0 provides a new function called rasterImage, which overlays a raster image onto a pre-existing plot. It turns out that the png package lets me easily create such a raster image, either from a PNG file or from a vector of bytes. Given a molecules, we can get the byte array of its PNG representation via the view.image.2d function in the latest rcdk. As a result, you can now make a plot and then overlay a 2D depiction within the plot area. For example to get the picture shown alongside, we could do:

m <- parse.smiles("C1CC2CC1C(=O)NC2")
img <- view.image.2d(m, 200,200)
plot(1:10, pch=19)
rasterImage(img, 2,6, 6,10)

The latest version of rcdk and rpubchem is not on CRAN yet, but you can get source packages for OS X & Linux and binary packages for Windows at Note that the latest version of rcdk requires R-2.11.0 along with rJava, rcdklibs, fingerprint and png as dependencies. If you’re interested in contributing check out the git repository.

Written by Rajarshi Guha

May 3rd, 2010 at 9:22 pm