So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for November, 2008

Brute Force – Inelegant, But Sometimes Useful

with one comment

A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.

Andrew Dalke posted a comment:

I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .

Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”

Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

Read the rest of this entry »

Written by Rajarshi Guha

November 20th, 2008 at 5:42 pm

BibDesk and Word

with one comment

Since writing papers is pretty much a way of life for an academic, I like to have tools that let me concentrate on the content, yet make beautiful documents with minimal effort on my part. The solution to this is LaTeX. While it gives me beautifully typeset documents, it doesn’t handle bibliographic data management. That job is left to BibTeX. This is a widely used plain text format for bibliographies.

Read the rest of this entry »

Written by Rajarshi Guha

November 20th, 2008 at 6:20 am

Posted in software

Tagged with , , ,

Java Port of VFLib Works and it’s Blazing

with 8 comments

Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.

Read the rest of this entry »

Written by Rajarshi Guha

November 18th, 2008 at 11:17 pm

Multi-threaded Database Access with Python

with 8 comments

Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster  multi-threaded queries.

Read the rest of this entry »

Written by Rajarshi Guha

November 14th, 2008 at 4:46 pm

Conformational Envelopes

without comments

Joe Leonard posted a question on the CCL mailing list today regarding “conformation envelopes”. More specifically, he asked

Has there been work on creating visualizations of “conformer envelopes”, graphical representations of the conformational space occupied (or available) to molecules. Particularly when such visualizations are used to (quickly/visually) compare whether 2 molecules can adopt the same shape – or if there are shapes of one that can’t be adopted by another.

A while back when I was investigating the use of the Ballester & Graham-Richards shape descriptors for 3D similarity searching. It turns out they perform quite poorly in enrichment benchmarks (which I’ll describe in a future post). At that time I was thinking of how Pub3D could scale to a multi-conformer version and I realized that the shape descriptors would allow me to easily visualize the “shape space” of a set of compounds. When these compounds are conformers for a molecule, one effectively gets a conformational envelope.

Read the rest of this entry »

Written by Rajarshi Guha

November 8th, 2008 at 10:49 pm