So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘similarity’ tag

Slides from a Guest Lecture at Drexel University

without comments

On Thursay I joined Antony Williams as a guest lecturer in Jean Claude-Bradleysclass on chemical information retrieval at Drexel University. Using a combination of WebEx and Skype, we were able to give our presentations – seamlessly joining three different locations. Technology is great! Tony gave an excellent talk on citizen science and ChemSpider and I spoke about similarity and searching. Jean Claude has also put up an audio version.

Written by Rajarshi Guha

December 5th, 2009 at 11:08 pm

Brute Force – Inelegant, But Sometimes Useful

with one comment

A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.

Andrew Dalke posted a comment:

I’ve wondered about this quote from the ANN page at .

Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”

Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

Read the rest of this entry »

Written by Rajarshi Guha

November 20th, 2008 at 5:42 pm

Conformational Envelopes

without comments

Joe Leonard posted a question on the CCL mailing list today regarding “conformation envelopes”. More specifically, he asked

Has there been work on creating visualizations of “conformer envelopes”, graphical representations of the conformational space occupied (or available) to molecules. Particularly when such visualizations are used to (quickly/visually) compare whether 2 molecules can adopt the same shape – or if there are shapes of one that can’t be adopted by another.

A while back when I was investigating the use of the Ballester & Graham-Richards shape descriptors for 3D similarity searching. It turns out they perform quite poorly in enrichment benchmarks (which I’ll describe in a future post). At that time I was thinking of how Pub3D could scale to a multi-conformer version and I realized that the shape descriptors would allow me to easily visualize the “shape space” of a set of compounds. When these compounds are conformers for a molecule, one effectively gets a conformational envelope.

Read the rest of this entry »

Written by Rajarshi Guha

November 8th, 2008 at 10:49 pm

Do the CDK Fingerprints Work?

with 12 comments

In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.

Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.

Read the rest of this entry »

Written by Rajarshi Guha

October 11th, 2008 at 5:47 am

Working With Fingerprints in R (can’t beat C!)

without comments

Since I do a lot of cheminformatics work in R, I’ve created various functions and packages that make life easier for me as do my modeling and analysis. Most of them are for private consumption.  However, I’ve released a few of them to CRAN since they seem to be generally useful.

One of them is the fingerprint package (version 2.9 was just uploaded to CRAN) , that is designed to read and manipulate fingerprint data generated from various cheminformatics toolkits or packages. Right now it supports output from the CDK, BCI and MOE. Fingerprints are represented using S4 classes. This allows me to override the R logical operators, so that one can do things like compute the logical OR of two fingerprints.

Read the rest of this entry »

Written by Rajarshi Guha

October 11th, 2008 at 12:14 am