Substructure Searches – High Speed, Large Scale

My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).

It is blazingly fast.

I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.

The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.

The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.

Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.

Overall, very impressive. And it’s something you can play with yourself.

5 thoughts on “Substructure Searches – High Speed, Large Scale

  1. Andrew Dalke says:

    Thanks for pointing this out. I’ll need to read more about what they’ve been doing.

    I was curious if they cached the results. C1CCC[U]CC1 takes 3.053s the first time then 1.657s the next two times, so something is going on but it’s not simply caching previous hits. Perhaps it’s just swapping in from disk.

    “[U][P][Na]” and “[*h8]” cause the system to hang. (That latter kills a PubChem search too :)

  2. Awesome!

    Esp. when the compounds continue to appear and appear…

  3. […] I said before, seriously fast substructure searches. It also helps that I can build these examples via a public […]

  4. Ernst-Georg Schmid says:

    “…I won’t go into the detail of the underlying pre-screen and indexing methodology…”

    Will there be any publication about this? I specifically wonder if it is a read-only optimized design or if the index can handle write access equally well?

  5. Hi Ernst, yes, we’re working on the paper right now, doing some more extensive benchmarking. At this point, it’s specifically read-only

Leave a Reply

Your email address will not be published. Required fields are marked *