BAZOO

So much to do, so little time

Trying to squeeze sense out of chemical data

Working With Fingerprints in R (can’t beat C!)

without comments

Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption.  However, I’ve released a few of them to CRAN since they seem to be generally useful.

One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.

An example of its usage is

1
2
3
4
fps <- fp.read('cdk-fps.txt', header=TRUE, lf = cdk.lf, size=1024)
newfp <- fps[[1]] | fps[[2]]
newfp <- fps[[1]] & fps[[2]]
distance(fps[[1]], fp[[2]], method='tanimoto')

The first line will create a list of fingerprint objects and we can then manipulate individual objects. Probably the most common task is calculating similarities between fingerprints. The package supports a wide variety of similarity and dissimilarity metrics include Tanimoto, Modified Tanimoto, Jaccard, Dice and so on, which can be evaluated using the “distance” function.

One of the problems with the old version of this package was that similarity calculations were quite slow. This was because the class represents the fingerprint as a vector of integers, where each integer is the position in the bit string set to 1. This representation was manipulated using R to get the various components of the similarity metrics.

While R is great for data analysis and algorithm prototyping, it’s well known that for code that needs to be fast, you’re going to have to go down to C or Fortran. Luckily, it’s very easy to call C code from R. So a quick rewrite of the Tanimoto and Euclidean metrics in C code resulted in speedups ranging from 40X to 50X, as shown in the benchmarks below. The numbers are the elapsed times for 5000 Tanimoto similarity calculations on two fingerprints, run on a MacbookPro (2GHz, 1GB RAM) with R 2.7.2

As expected larger fingerprints require more time. However, the performance doesn’t degrade too much. This version was just uploaded to CRAN, so should be available in a day or two.

FP Size (bits) Old Elapsed Time (sec) New Elapsed Time (sec)
1024 50.63 1.28
166 48.92 1.03
79 48.62 0.98

While this is a nice speed-up it’s still not as good as it could be. Andrew Dalke has some excellent essays on fast similarity calculations with binary fingerprints. While it’s true that the C code could do with these improvements, the bottleneck is the conversion of the S4 fingerprint class to a structure palatable for the C code and not the actual calculation in C.

Currently, before  evaluating the similarity via the C functions, the package will convert the fingerprint objects to numeric vectors of 1′s and 0′s of length N, where N is the length of the fingerprint. So each time we evaluate the similarity of a pair of fingerprints, we’re creating two new numeric vectors. These are then sent to the C function as double*. As a result, the C code must iterate over the elements of an array, rather than perform bitwise operations.  (One advantage of the current approach, is that arbitrary length fingerprints require no extra work).

However, the fact is, the data manipulation on the R side does take its toll. In fact, some quick testing showed me that by avoiding dealing with the S4 class accessors and directly sending a numeric vector can lead to a 200X speed up for pairwise similarity calculations. One way to avoid this problem would be to create the explicit numeric vector (containing the 1′s and 0′s) representation at object instantiation time. As a result every fingerprint object would carry the representation and we wouldn’t have to create it every time we wanted to evaluate the similarity.  The downside is that, the fingerprint objects would be larger. A quick look at some fingerprints that were lying around indicate that on average, a single fingerprint object is using 884, 1192 and 2321 bytes for a 79 bit, 166 bit and 1024 bit fingerprint. (The small sizes are due to the fact that most of the bits in the fingerprints are 0). By adding a numeric vector to hold the full, explicit fingerprint would add 656, 1352 and 8216 bytes  respectively, to each of the sizes noted previously. So it’s a space-time tradeoff. Whether it’s worth it will have to wait for some benchmarks.

In the end, if you need speed, it’s good to know C.

(And no, I’m not willing to go down to assembler!)

Written by Rajarshi Guha

October 11th, 2008 at 12:14 am

Leave a Reply