So much to do, so little time

Trying to squeeze sense out of chemical data

Substructure Searches – High Speed, Large Scale

with 5 comments

My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).

It is blazingly fast.

I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.

The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.

The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.

Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.

Overall, very impressive. And it’s something you can play with yourself.

Written by Rajarshi Guha

November 23rd, 2011 at 1:09 am

5 Responses to 'Substructure Searches – High Speed, Large Scale'

Subscribe to comments with RSS or TrackBack to 'Substructure Searches – High Speed, Large Scale'.

  1. Thanks for pointing this out. I’ll need to read more about what they’ve been doing.

    I was curious if they cached the results. C1CCC[U]CC1 takes 3.053s the first time then 1.657s the next two times, so something is going on but it’s not simply caching previous hits. Perhaps it’s just swapping in from disk.

    “[U][P][Na]” and “[*h8]” cause the system to hang. (That latter kills a PubChem search too :)

    Andrew Dalke

    23 Nov 11 at 12:11 pm

  2. Awesome!

    Esp. when the compounds continue to appear and appear…

  3. […] I said before, seriously fast substructure searches. It also helps that I can build these examples via a public […]

  4. “…I won’t go into the detail of the underlying pre-screen and indexing methodology…”

    Will there be any publication about this? I specifically wonder if it is a read-only optimized design or if the index can handle write access equally well?

    Ernst-Georg Schmid

    19 Dec 11 at 8:54 am

  5. Hi Ernst, yes, we’re working on the paper right now, doing some more extensive benchmarking. At this point, it’s specifically read-only

    Rajarshi Guha

    28 Dec 11 at 1:06 am

Leave a Reply