So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘R’ tag

Inserting 2D Depictions into R Plots

without comments

Recent versions of rcdk allow you to insert images of chemical structures into R plots, via the view.image.2d and rasterImage functions. One problem with the latter function is that the 2D structure image must be located in plot units, rather than pixel units. Paul Murrell suggested an easy way to insert the raster image into the plot region, maintaining the  native resolution of the image:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(rcdk)
m <- parse.smiles("O=C(C1=CC=CC=C1)C1=CC=CC=C1")[[1]]
img <- view.image.2d(m, 200,200)
plot(10:1, pch=19)

## Position the depiction at the lower left corner
dpi <- (par("cra")/par("cin"))[1]
usr <- par("usr")
xl <- usr[1]
yb <- usr[3]
xr <- xl + xinch(200/dpi)
yt <- yb + yinch(200/dpi)

rasterImage(img, xl,yb, xr,yt)

Written by Rajarshi Guha

November 20th, 2010 at 5:09 pm

Posted in cheminformatics,software

Tagged with , ,

Updates to R Packages

without comments

I’ve uploaded a new version of fingerprint (v 3.4) which now supports feature fingerprints – fingerprints that are represented as variable length vectors of numbers or strings. An example would be circular fingerprints. Now, when reading fingerprints you have to indicate whether you’re loading binary fingerprints or not (via the binary argument in fp.read). A new line parser function (ecfp.lf) is provided to load these types of files, though it’s trivial to write your own. Similarity can be evaluated between feature fingerprints in the usual manner, but the metrics are restricted to Tanimoto and Dice. A function is also available to convert a collection of feature fingerprints into a set of fixed length binary fingerprints (featvec.to.binaryfp) as described here.

New versions of rcdk (v 3.0.4) and rcdklibs (v 1.3.6.3) have also been uploaded to CRAN. These releases are based on todays CDK 1.4.x branch and resolve a number of bugs and add some new features

  • Correct formula generation
  • Correct handling of SD tags whose values are just white space
  • Proper generation of Murcko frameworks when molecule objects are requested
  • 3 new descriptors – FMF, acidic group count, basic group count

Written by Rajarshi Guha

October 22nd, 2010 at 1:58 am

Posted in cheminformatics

Tagged with , ,

Working with Sequences in R

with one comment

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1
2
3
4
seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {
  as.character(complement(RNAString(DNAString(x))))
})

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1
2
seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

1
2
3
4
5
x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
})
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Written by Rajarshi Guha

October 20th, 2010 at 10:11 pm

Posted in bioinformatics,software

Tagged with , ,

New Versions of rcdk and rcdklibs

with 2 comments

I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, so we now get access to SMSD and a few new descriptors.. In addition a some new methods have been included

  • cdk.version to get the version of the CDK being used by the package
  • is.subgraph uses SMSD to identify substructures. Similar to the pre-exisiting matches method, but much faster in general (though you cannot specify SMARTS)
  • get.mcs again, wraps SMSD and allows one to retrieve the MCS (as a molecule object or as atom indices) for a pair of molecules. Once again, should be pretty fast

Written by Rajarshi Guha

September 25th, 2010 at 10:05 pm

Posted in cheminformatics,software

Tagged with , ,

A Quick Look at the GSK Malaria Dataset

with 5 comments

A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.

There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.

I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.

The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.

Written by Rajarshi Guha

May 23rd, 2010 at 1:02 pm