So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘database’ tag

Quick Comments on an Analysis of Antithrombotics

without comments

Joerg has made a nice blog post on the use of Open Source software and data to analyse the occurence of antithrombotics. More specifically he was trying to answer the question

Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?

Using Blue Obelisk tools and ChemSpider and where Fontaine et al. refers to the Fontaine Factor Xa dataset. You should read his post for a nice analysis of the problem. I just wanted to consider two points he had raised.

Read the rest of this entry »

Written by Rajarshi Guha

January 5th, 2009 at 1:36 am

Brute Force – Inelegant, But Sometimes Useful

with one comment

A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.

Andrew Dalke posted a comment:

I’ve wondered about this quote from the ANN page at .

Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”

Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

Read the rest of this entry »

Written by Rajarshi Guha

November 20th, 2008 at 5:42 pm

Multi-threaded Database Access with Python

with 8 comments

Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster  multi-threaded queries.

Read the rest of this entry »

Written by Rajarshi Guha

November 14th, 2008 at 4:46 pm

AJAX’ified Pub3D

without comments

Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).

The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer.  jQuery rocks! (OK, the color scheme sucks)

There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.

OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.

Written by Rajarshi Guha

October 3rd, 2008 at 5:33 am