Archive for the ‘database’ tag
I’ve been working for some time with the PubChem Bioassay collection – a set of 1293 assays that cover a range of techniques (enzymatic, phenotypic etc.), targets and sizes (from 20 molecules to 200,000 molecules). In addition, some assays are primary, high-throughput assays whereas a number of them are smaller, confirmatory assays. While an extremely valuable collection, one of the drawbacks is the lack of curation. This has led to some people saying that the data is too noisy to be useful. Yes, the noise is a problem, but I think there’s still useful data to extract and model.
One of the problems that I have faced is that while one can perform a full text search for assays on PubChem, there is no form of annotations on the assays themselves. One effect of this is that it is difficult to link an assay to other biological resources (though for enzymatic assays, one can determine a Pubmed protein identifier). While working on my bioassay network project, I needed annotations and I didn’t want to do it manually.
Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?
Using Blue Obelisk tools and ChemSpider and where Fontaine et al. refers to the Fontaine Factor Xa dataset. You should read his post for a nice analysis of the problem. I just wanted to consider two points he had raised.
A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.
I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .
“Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”
Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.
Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster multi-threaded queries.
Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).
The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer. jQuery rocks! (OK, the color scheme sucks)
There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.
OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.