A few days back, Hari on FriendFeed had asked how one could get a a CAS number from a PubChem compound ID (CID). The reverse, that is finding a CID for a given CAS number is generally quite easy as shown by Rich here and here. Since I was trying to get some writing done, this was a good excuse for a quick hack to solve the problem.
At IU I maintain a partial mirror of PubChem, which stores the compound, substance, association and synonym data (and a partial dump of the bioassay data) in a PostgreSQL database, updated on a monthly schedule. For this problem, the synonym table contains two columns – CID and synonyms associated with a CID. For example, the synonyms for CID 2244 include aspirin, Acenterine, Salcetogen, 2-Acetoxybenzoic acid, Acetilum acidulatum and so on. In fact, 272 synonyms are listed for this CID.
Now, the PubChem synonym data contains CAS numbers for many of the CID’s. The problem is that they are not marked as CAS numbers. In other words they’re just plain text. However, CAS numbers do follow a specific format, best described using a regular expression
Thus resolving the CAS number for a given CID reduces to a SQL query to identify the synonyms for the CID and then identifying the synonym entries that match the above regular expression. Unfortunately PostgreSQL does not allow one to use POSIX regexes. Instead a slightly more verbose form has to be used:
SELECT synonym FROM pubchem_synonym WHERE
cid = '2244' AND
(synonym similar TO '__-__-_' OR
synonym similar TO '___-__-_' OR
synonym similar TO '____-__-_' OR
synonym similar TO '_____-__-_' OR
synonym similar TO '______-__-_' OR
synonym similar TO '_______-__-_');
Since the CID field has a hash index, getting the set of synonyms is very fast and so doing the regex match on a few tens or hundreds of entries is not too much of a problem.
Given the above SQL query, we can wrap it up in a mod_python script to provide a simple REST interface to this functionality:
Now, there are some caveats. Since we update our mirror once a month, we are a bit out of date from the real PubChem. Furthermore, as the link will show, aspirin appears to have multiple CAS numbers in PubChem and this problem has been discussed by Antony a number of times. Finally, given the fact that it might be possible that some arbitrary string matches the CAS regex, the code evaluates the checksum to ensure that the matched string is a valid CAS number (but of course, not necessarily a correct CAS number).