So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘fingerprint’ tag

New Version of fingerprint

with 3 comments

I’ve submitted version 3.4.3 of the fingerprint package to CRAN, so it should be available in a day or two. It’s an R package that lets you read in (chemical structure) fingerprint data from a variety of sources (CDK, MOE, BCI etc) and perform a variety of operations (bitwise, similarity, etc.) and visualizations on them.

The two main additions to this version are

  • Read support for the new FPS fingerprint format described by Andrew Dalke at the chemfp project. Note, it currently discards some of header information
  • The fingerprint class now has a field, misc, (a list) that allows one to read in extra, arbitrary data that might be provided along with a fingerprint. Exactly what gets stored in this field depends on the line function used to read in the fingerprint data. Currently only the FPS parser returns extra data (when available) in this field.

As always, you can get the package source directly from the Github repository.

Written by Rajarshi Guha

June 3rd, 2011 at 12:13 am

Posted in cheminformatics

Tagged with , ,

Caching SMARTS Queries

with 3 comments

Andrew Dalke recently published a detailed write up on his implementation of the Pubchem fingerprints and provided a pretty thorough comparison with the CDK implementation. He pointed out a number of bugs in the CDK version; but he also noted that performance could be improved by caching parsed SMARTS queries – which are used extensively in this fingerprinter. So I wanted to see whether caching really helps.

The CDK SMARTS parser is a JJTree based implementation and my anecdotal evidence suggested that SMARTS parsing was not a real bottleneck. But of course, nothing beats measurements. So I modified the SMARTSQueryTool class to cache parsed SMARTS queries using a simple LRU cache (patch):

1
2
3
4
5
6
final int MAX_ENTRIES = 20;
Map<String, QueryAtomContainer> cache = new LinkedHashMap<String, QueryAtomContainer>(MAX_ENTRIES + 1, .75F, true) {
    public boolean removeEldestEntry(Map.Entry eldest) {
        return size() > MAX_ENTRIES;
    }
};

The map is keyed on the SMARTS string. Then, when we attempt to parse a new SMARTS query, we check if it’s in the cache and if not, parse it and place it in the cache.

Now, the thing about this caching mechanism is that after 20 SMARTS queries have been cached, the least recently used one is replaced with a new one. As a result, if we perform matching with 40 unique SMARTS (in sequence) only the last 20 get cached, for a given molecule. But when we go to do it on a new molecule, the first 20 are not in the cache and hence we shouldn’t really benefit from the caching scheme. In general, if the fingerprinter (or any caller of the parser) will perform N unique SMARTS queries for a single molecule, the cache size must be at least N, for the cache to be used for subsequent molecules.

I implemented a quick test harness, reading in 100 SMILES and then generating Pubchem fingerprints for each molecule. The fingerprint generation was repeated 5 times and the time reported for each round. The box plot shows the distribution of the timings. Now, the CDK implementation has 621 SMARTS patterns – as you can see, we only get a benefit from the caching when the cache size is 700. In fact, cache sizes below that lead to a performance hit  - I assume due to time required to query the map.

While the performance improvement is not dramatic it is close to 10% compared to no-caching at all. In actuality, the major bottleneck in the SMARTS parser is the actual graph isomorphism step (which we hope to drastically improve by using the SMSD code). Maybe then, SMARTS parsing will take a bigger fraction of the time. Also keep in mind that due to the heavyweight nature of CDK molecule objects, very large caches could be a strain on memory. But to check that out, I should use a profiler.

Written by Rajarshi Guha

January 23rd, 2011 at 12:56 am

Similarity Matrices in Parallel

without comments

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s worthwhile to rethink the problem.

But for those situations where you do need to evaluate it, a simple way to parallelize the calculation is to evaluate the similarity of each molecule with all the rest in parallel. This means each process/thread must have access to the entire set of fingerprints. So again, for very large collections, this is not always practical. However, for small collections parallel evaluation can lead to speed ups.

The fingerprint package provides a method to directly get the similarity matrix for a set of fingerprints, but this is implemented in interpreted R so is not very fast. Given a list of fingerprints, a manual evaluation of the similarity matrix can be done using nested lapply’s:

1
2
3
4
library(fingerprint)
sims <- lapply(fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

For 1012 fingerprints, this takes 286s on my Macbook Pro (4GB, 2.4 GHz). Using snow, we can convert this to a parallel version, which takes 172s on two cores:

1
2
3
4
5
6
7
8
library(fingerprint)
library(snow)
cl <- makeCluster(4, type = "SOCK")
clusterEvalQ(cl, library(fingerprint))
clusterExport(cl, "fps")
sim <- parLapply(cl, fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

Written by Rajarshi Guha

December 2nd, 2010 at 1:03 am

Updates to R Packages

without comments

I’ve uploaded a new version of fingerprint (v 3.4) which now supports feature fingerprints – fingerprints that are represented as variable length vectors of numbers or strings. An example would be circular fingerprints. Now, when reading fingerprints you have to indicate whether you’re loading binary fingerprints or not (via the binary argument in fp.read). A new line parser function (ecfp.lf) is provided to load these types of files, though it’s trivial to write your own. Similarity can be evaluated between feature fingerprints in the usual manner, but the metrics are restricted to Tanimoto and Dice. A function is also available to convert a collection of feature fingerprints into a set of fixed length binary fingerprints (featvec.to.binaryfp) as described here.

New versions of rcdk (v 3.0.4) and rcdklibs (v 1.3.6.3) have also been uploaded to CRAN. These releases are based on todays CDK 1.4.x branch and resolve a number of bugs and add some new features

  • Correct formula generation
  • Correct handling of SD tags whose values are just white space
  • Proper generation of Murcko frameworks when molecule objects are requested
  • 3 new descriptors – FMF, acidic group count, basic group count

Written by Rajarshi Guha

October 22nd, 2010 at 1:58 am

Posted in cheminformatics

Tagged with , ,

A Comment on Fingerprint Performance

without comments

In a comment to my previous post on bit collisions in hashed fingerprints, Asad reported on some interesting points which would be useful to have up here:

Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best bet is ~ 13%.

1) The hashed FP (CDK) is good enough to separate patterns which are not common but on a large dataset (in my case 10000+ mols), the performance drops drastically. Top 1% hits were good but then rest of the started to loose specificity (esp when Tanimoto score was around 0.77).

2) First I thought it was an artifact of the Tanimoto score… but I wasn’t convinced spl. in cases where we had rings (close vs open). I ended up writing a new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.

At the end of the day, I believe that all these searches are heuristic and hashed FP is faster to generate but prone to bit clashes where as SMARTS based FPs are slower to generate (as u spend time in MCS) in matching patterns but they are more sensitive and specific as you can trace the patterns (u get what u see) as the patterns and bitset relationship is know and static

Written by Rajarshi Guha

October 9th, 2010 at 3:23 am

Posted in cheminformatics,software

Tagged with , ,