Joerg has made a nice blog post on the use of Open Source software and data to analyse the occurence of antithrombotics. More specifically he was trying to answer the question
Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?
Using Blue Obelisk tools and ChemSpider and where Fontaine et al. refers to the Fontaine Factor Xa dataset. You should read his post for a nice analysis of the problem. I just wanted to consider two points he had raised.
First, he tried to use ChemSpider to go from InChiKey to a PubChem Compound ID (CID). He commented
Though, this web-based retrieval is still extremely slow and suboptimal, and certainly not usable for thousands of molecules.
I have had some problems recently using ChemSpider programmatically with respect to timeouts. Furthermore, looking at Joergs code suggests that he has to parse out a CID from the resultant page, resulting in quite a bit of complexity.
At IU, I maintain a local mirror of PubChem in a PostgreSQL database. My first response was to get the InChiKeys that Joerg posted and do a quick SQL query of the form
1 | SELECT cid FROM pubchem_compound WHERE inchikey IN ('DOXFBRJKDIQAPF-IFDINXDUAL','STPQKWOPPYLTBM-KHRKUUQNAT', ... ); |
This took 0.63 sec and returned zero hits (which matches what Joerg observed). While SQL access to the mirror is publicly available, a REST interface makes things simpler. Thus visiting the URL
http://rguha.ath.cx/~rguha/cicc/rest/db/pubchem/cidikey/RYYVLZVUVIJVGH-UHFFFAOYAW
returns a plain text page with the CID of caffeine. This is simpler than constructing an SQL query and there’s nothing to parse. The following script tries to find the CID’s for the InChiKeys that Joerg mentioned
1 2 3 4 5 6 7 8 9 10 | import urllib baseurl = 'http://rguha.ath.cx/~rguha/cicc/rest/db/pubchem/cidikey/' srcfile = 'http://www.joergkurtwegner.de/blog/20090104/fXa.inchi.cid.tab.txt' inchikeys = [x.split()[0] for x in urllib.urlopen(srcfile).readlines()] print 'Got %d InChI keys' % (len(inchikeys)) for inchikey in inchikeys: cid = urllib.urlopen(baseurl+inchikey).readlines() if len(cid) == 1: print inchikey, cid[0] |
Certainly not as sophisticated or robust as Joergs version, but does the job. A quick timing of this version showed that it took a total of 208 sec for 579 InChiKeys . As Joerg pointed out, this is not fast enough to handle thousands of compounds. Given that the direct SQL query is 300 times faster than the REST queries (both were done across the network), this suggests that having a local mirror or data dump would be helpful for such types of studies.
The second point that Joerg talks about is about similarity queries. He notes
Maybe there exist already a web-service for returning for a single molecule query the most similar or the most five similar compounds, but I have not found out how
One problem I forsee with this, at least in the PubChem collection, is that such a similarity query would return many extremely similar compounds. While one could filter out salt forms, this would certainly slow down such a query. To get more variety one would have to use lower similarity cutoffs, which cause the search to be slower. However, using the heuristics described by Swamidass and Baldi, this is easy to implement (and efficient as long as the similarity cutoff is relatively high).