I met with Jean-Claude Bradley yesterday and we had a pretty useful hack session, allowing him to easily incorporate chemical and cheminformatics functionality into a GoogleDocs spreadsheet.
A common task that Jean-Claude wanted to automate was the calculation of milligrams (or milliliters) of a chemical required for a certain molarity. So what we need for this calculation is the compound name, desired molarity, molecular weight and the density. Importantly, the people who’d like to use this will provide compound names and not a directly parseable SMILES. So we’d also like to (optionally) get the SMILES. Finally, he wanted to be able to do this in a Google spreadsheet – rather than a specific web page or stand alone program.
It turns out that with a liberal helping of Python, a dash of ChemSpider and pinch of PubChem, all of this can be done in a half hour hack session.
We start of with a Google spreadsheet containing a column of chemical names such as “benaldehyde”, “aspirin” and so on. The first thing we need is a value of density. Since ChemSpider reports experimental and predicted values of density the quickest way was to extract it from there. Unfortunately I couldn’t work out how to use the Web API, so I fell back on the venerable tradition of screen scraping. Thus, one can get the page for benzaldehyde via
http://www.chemspider.com/Search.aspx?q=benzaldehyde
and then using a regular expression, extract the value of density. In some cases, this won’t be available or multiple values may be listed. Right now we just look at the first available value. To make life simple, I wrapped this is in a simple mod_python program resulting in a simple REST interface to density values:
http://rguha.ath.cx/~rguha/cicc/rest/db/chemspider/benzaldehyde
This returns a page containing a single number for the density and if it couldn’t find it, returns -1. We also get a nice freebie – replacing with benzaldehyde with c1ccccc1C=O also returns the density value. One could also provide the InChI key and get the same result.
OK, so we have an easy way to get the density of a compound given its name in an automated fashion. How to get this into the spreadsheet? If the compound name is in cell A2, simply set the value of cell B2 (for example) to
=ImportData("http://rguha.ath.cx/~rguha/cicc/rest/db/chemspider/"&A2)
and on hitting enter the cell value should be 1.045. As you’d expect, dragging B2 down should replicate the function appropriately over the rows.
The next thing we want is to get the molecular weight. While we can also extract this from the ChemSpider web pages, it was a little more tricky. I first tried BeautifulSoup but it seems that ChemSpider has some weird Unicode stuff going on and this approach didn’t work (at least in the 10 minutes I looked at it!). So I fell back to our trusty mirror PubChem. This database has a collection of synonyms associated with each compound. Given a chemical name, we can look up the synyonym and identify a compound id (CID) and simply pull out the reported molecular weight in a single SQL query. Once again, this gets wrapped in a simple REST interface
http://rguha.ath.cx/~rguha/cicc/rest/db/pubchem/synmw/benzaldehyde
giving back a single number if the molecule was found in the database. As with the density, we can set the value of a cell in the spreadsheet using ImportData. Of course it’s useful to get the SMILES for a chemical name. While there are tools that one could use such as LexiChem, one could use ChemSpider. However I already had a REST service that uses our local PubChem mirror allowing one to write
http://rguha.ath.cx/~rguha/cicc/rest/db/pubchem/synonym/benzaldehyde
giving back the SMILES. Once again, this is easily incorporated into a Google spreadsheet. So at this point we have columns of chemical names (manually entered), desired molarity (manually entered), molecular weight (via REST) and density (via REST). The final calculation is trivial and voila, one has the required mililiters of the chemical required.
Caveats
This was a quick hack so there are a number of issues. It’s possible that the ChemSpider query will return multiple hits. Right now, such cases are treated as getting no hits – since we don’t know which one we want. We’re using the PubChem mirror for name to structure / weight lookup. There are two issues – first, it’s usually a month out of sync with the real PubChem. And second, multiple hits for a given name are possible and if the input name is slightly misspelled, it may not be found in the DB even though it exists. Another problem is that while we can get the density via common name, SMILES or InChi, the PubChem mirror lookup only works with common name. But extending this to support other forms of input is trivial.
Conclusions
Ideally, it’d be all done through one source (say ChemSpider), but the current approach does highlight the distributed nature of the solution. In the end the individual pieces are quite simple in terms of code and content. But by putting it all together, we end up with a very useful application that does real stuff!
But the bigger message that comes out of this is that while Google spreadsheets is a general purpose tool, it’s online nature, coupled with the ability to import data from the web makes it customisable for different domains. In this case, it turns out we can do cheminformatics in a Google spreadsheet.
Nice post , Its great to see how you use mod python to provide a URL interface to a python routine . I an now see how you probably may have implemented the CID to CAS lookup by screenscraping the result using a regexp for CAS ID and returning it via the mod_python interface
Very cool stuff. Been learning python and its good to see how mod_python belongs in this ecosystem
Thanks! Actually, the CID to CAS didn’t involve screen scraping – I use our local PubChem mirror to do that, looking for synonyms that match the CAS regex. I should do a quick post on it
Recently I found the online database http://dabbledb.com maybe you find it useful
Vladimir, thanks for the pointer. I’ve seen a number of people talk about this, looks very powerful