## Archive for the ‘R’ tag

## Similarity Matrices in Parallel

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s worthwhile to rethink the problem.

But for those situations where you do need to evaluate it, a simple way to parallelize the calculation is to evaluate the similarity of each molecule with all the rest in parallel. This means each process/thread must have access to the entire set of fingerprints. So again, for very large collections, this is not always practical. However, for small collections parallel evaluation can lead to speed ups.

The fingerprint package provides a method to directly get the similarity matrix for a set of fingerprints, but this is implemented in interpreted R so is not very fast. Given a list of fingerprints, a manual evaluation of the similarity matrix can be done using nested lapply’s:

1 2 3 4 | library(fingerprint) sims <- lapply(fps, function(x) { unlist(lapply(fps, function(y) distance(x,y))) }) |

For 1012 fingerprints, this takes 286s on my Macbook Pro (4GB, 2.4 GHz). Using snow, we can convert this to a parallel version, which takes 172s on two cores:

1 2 3 4 5 6 7 8 | library(fingerprint) library(snow) cl <- makeCluster(4, type = "SOCK") clusterEvalQ(cl, library(fingerprint)) clusterExport(cl, "fps") sim <- parLapply(cl, fps, function(x) { unlist(lapply(fps, function(y) distance(x,y))) }) |

## Inserting 2D Depictions into R Plots

Recent versions of rcdk allow you to insert images of chemical structures into R plots, via the view.image.2d and rasterImage functions. One problem with the latter function is that the 2D structure image must be located in plot units, rather than pixel units. Paul Murrell suggested an easy way to insert the raster image into the plot region, maintaining the native resolution of the image:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | library(rcdk) m <- parse.smiles("O=C(C1=CC=CC=C1)C1=CC=CC=C1")[[1]] img <- view.image.2d(m, 200,200) plot(10:1, pch=19) ## Position the depiction at the lower left corner dpi <- (par("cra")/par("cin"))[1] usr <- par("usr") xl <- usr[1] yb <- usr[3] xr <- xl + xinch(200/dpi) yt <- yb + yinch(200/dpi) rasterImage(img, xl,yb, xr,yt) |

## Updates to R Packages

I’ve uploaded a new version of fingerprint (v 3.4) which now supports feature fingerprints – fingerprints that are represented as variable length vectors of numbers or strings. An example would be circular fingerprints. Now, when reading fingerprints you have to indicate whether you’re loading binary fingerprints or not (via the **binary** argument in **fp.read)**. A new line parser function (**ecfp.lf**) is provided to load these types of files, though it’s trivial to write your own. Similarity can be evaluated between feature fingerprints in the usual manner, but the metrics are restricted to Tanimoto and Dice. A function is also available to convert a collection of feature fingerprints into a set of fixed length binary fingerprints (**featvec.to.binaryfp**) as described here.

New versions of rcdk (v 3.0.4) and rcdklibs (v 1.3.6.3) have also been uploaded to CRAN. These releases are based on todays CDK 1.4.x branch and resolve a number of bugs and add some new features

- Correct formula generation
- Correct handling of SD tags whose values are just white space
- Proper generation of Murcko frameworks when molecule objects are requested
- 3 new descriptors – FMF, acidic group count, basic group count

## Working with Sequences in R

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1 2 3 4 | seqs <- get_sequences_from_db() seqs <- sapply(seqs, function(x) { as.character(complement(RNAString(DNAString(x)))) }) |

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1 2 | seqs <- get_sequences_from_db() seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs)))) |

A slightly contrived example shows the performance improvement

1 2 3 4 5 | x <- sapply(1:1000, function(x) { paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='') }) system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x))))) system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) ))) |

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

## New Versions of rcdk and rcdklibs

I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, so we now get access to SMSD and a few new descriptors.. In addition a some new methods have been included

**cdk.version**to get the version of the CDK being used by the package**is.subgraph**uses SMSD to identify substructures. Similar to the pre-exisiting**matches**method, but much faster in general (though you cannot specify SMARTS)**get.mcs**again, wraps SMSD and allows one to retrieve the MCS (as a molecule object or as atom indices) for a pair of molecules. Once again, should be pretty fast