The ONSChallenge has been running for some time now and the simple web query form that tied in the data from Google Docs along with web services from IU has turned out to be pretty handy. With more and more data becoming available, I had done some initial exploratory analysis of the measured solubilities. One thing that is useful to the experimentalists is a suggestion of which compound to test next. This could be made on the basis of many factors – availability, ease of synthesis and so on. But one way to look at it is to examine what types of compounds have been tested previously, and suggest that the subsequent compounds be very different from those that have been tested.
A conceptually simple way to do this is to describe molecules numerically, using molecular descriptors such as TPSA, molecular weight, surface area and so on. Having selected a few, relevant descriptors, one has a multi-dimensional “chemical space” – each molecule is a point in this space. One can then visualize the distribution of the molecules in this space using a variety of techniques such as multi-dimensional scaling and principal components analysis (PCA) – which convert the original high-dimensional space to one of lower dimensionality. One expects that structurally similar molecules will be located near each other in the multi-dimensional chemical space and this relationship will be maintained in the scaled space as well. As a result one can identify regions of the space (corresponding to structural features) that have been over-represented in previous experiments as well as structural features that have not received much attention and so would be good to focus on in new experiments.
The idea of exploring and visualizing chemical spaces is not new (Deursen in 2007, Oprea in 2001, Cummins in 1996). Using PCA, one can easily convert the (usually) high dimensional chemical space to a simpler 2D or 3D form. An example is shown here. This was done offline – but it’d be useful to be able to easily generate such plots or even the data required for the plot using simple REST interfaces. That way, one could generate dynamic plots from live data, as is done in the solubility query page.
I put together a simple REST interface to PCA using a combination of numpy and mod_python. The interface is not direcly for PCA itself, but allows one to generate the 2D coordinates of compounds derived from a pre-defined chemical space (ALogP, TPSA, molecular weight and rotatable bond count). More simply the steps are
- Generate the descriptor matrix for the compounds (specified as SMILES)
- Center the descriptor matrix
- Perform PCA
- Generate score matrix
- Return the first two columns
As an example, try
http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/ c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O,CCC(=O)O
The result is a textual representation of the first two principal components. This should be easily parsed using your language of choice allowing one to easily make a plot of the first two PC’s (using a CGI, the Google Viz API or even importing into Google Docs). Ideally the data would be used to generate annotated plots, so that each point would be linked to the molecule it represents – I haven’t found a way to do this with the Google Viz API
Of course this is not a complete application – right now it’s not very robust to errors in descriptor calculations. Also, if there are too many molecules specified, the web server might not be too happy. More importantly, the descriptor space selected, is not necessarily the best or most informative. Selecting a set of descriptors is very important as one wants to capture structural similarities but also highlight differences. These four descriptors were chosen arbitrarily, but it’s easy to update the code allow for alternative descriptor spaces. In general, visiting
http://rguha.ath.cx/~rguha/cicc/rest/chemspace
will return a simple XML list of available descriptor spaces. Right now, there’s just one, the default space.
Update
I modified the service so that it will return the first two PC’s in different formats depending on the content-type accepted by the client (specified in the Accept header). It looks at the first acceptable content-type and currently handles text/html and text/plain. So if you view the first URL in a browser you get a HTML formatted table. If you use a Python client using urllib2 (which allows you to specify HTTP headers) you can get a HTML table or chunk of plain text depending on the setting of the Accept header. Note that if no Accept is specified then the text/html is used.
Update 2
The service now supports JSON output. Just ensure that the content-type accepted by the client is application/json
I guess what we really should be doing is map all compounds in one particular ugi reagent class (e.g. aldehyde) from ChemSpider, and find those who are most dissimilar in PC1-PC2 space from what has been measured already…
Egon, yes, that’s definitely one way to go about it. I was more thinking of it being used as a sort of map to indicate what compounds had been studied, and which types might deserve some more attention.
Egon – the SolubilitiesSum table has a column for reagent class
[…] 3, 2009 by Rajarshi Guha I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL […]
[…] type of interface makes it easy to whip up various applications. One example is the PCA analysis of compound collections. Another one I put together today based on a conversation with Jean-Claude […]