So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘pubchem’ tag

Getting a CAS Number from a PubChem CID

with one comment

A few days back, Hari on FriendFeed had asked how one could get a a CAS number from a PubChem compound ID (CID). The reverse, that is finding a CID for a given CAS number is generally quite easy as shown by Rich here and here. Since I was trying to get some writing done, this was a good excuse for a quick hack to solve the problem.

Read the rest of this entry »

Written by Rajarshi Guha

December 12th, 2008 at 2:49 am

Posted in software

Tagged with , , ,

Chemistry in Google Docs

with 4 comments

I met with Jean-Claude Bradley yesterday and we had a pretty useful hack session, allowing him to easily incorporate chemical and cheminformatics functionality into a GoogleDocs spreadsheet.

A common task that Jean-Claude wanted to automate was the calculation of milligrams (or milliliters) of a chemical required for a certain molarity.  So what we need for this calculation is the compound name, desired molarity, molecular weight and the density. Importantly, the people who’d like to use this will provide compound names and not a directly parseable SMILES.  So we’d also like to (optionally) get the SMILES. Finally, he wanted to be able to do this in a Google spreadsheet – rather than a specific web page or stand alone program.

It turns out that with a liberal helping of Python, a dash of ChemSpider and pinch of PubChem, all of this can be done in a half hour hack session.

Read the rest of this entry »

Written by Rajarshi Guha

December 10th, 2008 at 4:23 pm

Do the CDK Fingerprints Work?

with 12 comments

In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.

Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.

Read the rest of this entry »

Written by Rajarshi Guha

October 11th, 2008 at 5:47 am

Datasets for Virtual Screening Benchmarks

with one comment

Virtual screening (VS) is a common task in the drug discovery process and is a computational method to identify  promising compounds from a collection of hundreds to millions of possible compounds. What “promising” exactly means, depends on the context – it might be compounds that will likely exhibit certain pharmacological effects. Or compounds that are expected to non-toxic. Or combinations of these and other properties. Many methods are available for virtual screening including similarity, docking and predictive models.

So, given the plethora of methods which one do we use? There are many factors affecting choice of VS method including availability, price, computational cost and so on. But in the end, deciding which one is better than another depends on the use of benchmarks. There are two features of VS benchmarks: the metric employed to decide whether one method is better than another and the data used for benchmarking. This post focuses on the latter aspect.

Read the rest of this entry »

Written by Rajarshi Guha

October 9th, 2008 at 1:49 pm

AJAX’ified Pub3D

without comments

Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).

The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer.  jQuery rocks! (OK, the color scheme sucks)

There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.

OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.

Written by Rajarshi Guha

October 3rd, 2008 at 5:33 am