So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘c++’ tag

fingerprint 3.5.2 released

with 2 comments

Comparison of nested loop performance in R and C for Tanimoto similarity matrix calculation.

Comparison of nested loop performance in R and C for Tanimoto similarity matrix calculation.

Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.

His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.

Written by Rajarshi Guha

October 27th, 2013 at 10:44 pm

Posted in cheminformatics,software

Tagged with , ,

Working With Fingerprints in R (can’t beat C!)

without comments

Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption.¬† However, I’ve released a few of them to CRAN since they seem to be generally useful.

One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.

Read the rest of this entry »

Written by Rajarshi Guha

October 11th, 2008 at 12:14 am

CDL – A Cheminformatics Toolkit

with 5 comments

The Chemical Descriptors Library (CDL) has been around for a while, but hasn’t seemed to get much publicity. A paper describing the design and performance of the library just came out today. While the name suggests a library of descriptors, it’s actually a general C++ library for cheminformatics. The library appears to use the molecular graph as its core concept and uses the Boost Graph Library (BGL) to represent and manipulate molecular graphs. Some features include substructure searching using SMARTS, fingerprints, descriptors (CATS, a bunch of topological’s etc) and file format reading (SMILES and SDF as far as I can see).

It seems nice and is available under the Boost Software License. While it does a lot of the basic operations, it doesn’t appear as comprehensive as say OpenBabel or RDKit. However, it’s good to see the cheminformatics toolkit ecosystem growing.

An aside – I haven’t really done much C++ coding and what little I do is basically ‘C in C++’. But how do people get their heads around C++ templates? I tend to get a headache when trying to examine one. And I thought that writing Java was tedious – C++ with templates takes the cake!

Update - Their Sourceforge project page is here but I can’t seem to find a download link.¬† A software paper with no software!

Written by Rajarshi Guha

September 20th, 2008 at 2:20 pm

Posted in cheminformatics,software

Tagged with ,