My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).
It is blazingly fast.
I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.
The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.
The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.
Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.
Overall, very impressive. And it’s something you can play with yourself.
I came across an ASAP paper today describing substructure searching in Oracle databases. The paper comes from the folks at J & J and is part of their series of papers on the ABCD platform. Performing substructure searches in databases is certainly not a new topic and various products are out there that support this in Oracle (as well as other RDBMSs). The paper describes how the ABCD system does this using a combination of structure-derived hash keys and an inverted bitset based index and discuss their implementation as an Oracle cartridge. They provide an interesting discussion of how their implementation supports Cost Based Optimization of SQL queries involving substructure search. The authors run a number of benchmarks. In terms of comparative benchamrks they compare the performance (i.e., screening efficiency) of their hashed keys versus MACCS keys, CACTVS keys and OpenBabel FP2 fingerprints. Their results indicate that the screening step is a key bottleneck in the query process and that their hash key is generally more selective than the others.
Unfortunately, what would have been interesting but was not provided was a comparison of the performance at the Oracle query level with other products such as JChem Cartridge and OrChem. Furthermore, the test case is just under a million molecules from Golovin & Henrick – the entire dataset (not just the keys) could probably reside in-memory on todays servers. How does the system perform when say faced with PubChem (34 million molecules)? The paper mentions a command line implementation of their search procedure, but as far as I can tell, the Oracle cartridge is not available.
The ABCD system has many useful and interesting features. But as with the other publications on this system, this paper is one more in the line of “Papers About Systems You Can’t Use or Buy“. Unfortunate.
The time has come to move again – though, in this case, it’s just a geographic move. From August I’ll be living in Manchester, CT (great cheeseburgers and lovely cycle routes) and will continue to work remotely for NCGC. I’ll be travelling to DC every month or so. The rest of the time I’ll be working from Connecticut.
Being new to the area, it’d be great to meet up over a beer, with people in the surrounding areas (NY/CT/RI) doing cheminformatics, predictive modeling and other life science related topics (any R user groups in the area?). If anybody’s interested, drop me a line (comment, mail or @rguha).
With the 2011 Fall ACS meeting coming up in Denver next month, CINF will be hosting another round of lightning talks – 8 minutes to talk about anything related to cheminformatics and chemical information. As before, these talks won’t be managed via PACS, as a result of which we are taking short abstracts between July 14 and Aug 14.We hope that we’ll get to hear about interesting and recent stuff. Remember, this is meant to be a fun event so be creative! (You can see slides from the first run of this session last year).
The full announcement is below:
For the 2011 Fall meeting in Denver (Aug 28 – Sep 1), CINF will be running an experimental session of lightning talks – short, strictly timed talks. The session does not have a specific topic, however, all talks should be related to cheminformatics and chemical information. One of the key features of this session is that we will not be using the traditional ACS abstract submission system, since that system precludes the inclusion of recent work in the program.
So, since we will be accepting abstracts directly, the expectation is that they be about recent work and developments, rather than rehashes of year-old work. In addition, talks should not be verbal versions of posters submitted for this meeting. Given the short time limits we don’t expect great detail – but we are expecting compact and informative presentations.
That’s the challenge.
- Talks should be no longer than 8 minutes in length. At 8 minutes, you will be asked to stop.
- Use as many slides as you want, as long as you can finish in 8 minutes
- Talks should not be rehashes of poster presentations
- Talks will run back to back, and questions & discussion will be held of off until the end
If you haven’t participated in these types of talks before here are some suggestions:
- No more than three slides for a 5 minute talk (but if you can pull of 20 slides in 8 minutes, more power to you)
- Avoid slides with too much text (and don’t paste PDF’s of papers!)
- A single chart per slide and make sure labels are readable at a distance
1:30pm, Wednesday, August 31st, 2011
Submissions run from July 14 to Aug 14
Room 112, Colorado Convention Center
- Send in an abstract of about 100 – 120 words to firstname.lastname@example.org
- We will let you know if you will be speaking by Aug 21 and we will need slide decks by Aug 24
- You must be registered for the meeting
- Note that the usual publication/copyright rules apply
- We will encourage live blogging and tweets (if we have net access)
With the recent stable release of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the rcdk (3.1.1) and rcdklibs (1.3.11) packages that support cheminformatics in R. They’ve been pushed to CRAN and should be visible in a day or two. The new features in the latest version of rcdk include
- Directly evaluate molecular volume (based on group contributions) using get.volume
- Generate fingerprints using the hybridization state
- get.total.charge and get.total.formal.charge work sensibly
- Added a function (copy.image.to.clipboard) that copies the 2D depiction of a molecule to the system clipboard in PNG format
- Now, OS X users can view and copy molecule depictions. This is slower compared to the same operation on Windows or Linux since it involves shell’ing out via system. But it is better than not being able to view anything.