So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘numpy’ tag

Extending the REST PCA Service

with 12 comments

I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O

one would get a HTML, plain text or JSON page containing the first two principal components for the molecules specified. With this data one can generate a simple 2D plot of the distributions of molecules in the “default” chemical space.

However, as Andrew Lang pointed out on FriendFeed, one could use SecondLife to look at 3D versions of the PCA results. So I updatesd the service to allow one to specify the number of components in the URL. The above form of the service will still work – you get the first two components by default.

To specify more components use an URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

where mol1, mol2, mol3 etc should be valid SMILES strings. The above URL will return the first three PC’s. To get just the first PC, replace the 3 with 1 and so on. If more components are requested than available, all components are returned.

Currently, the only available space is the “default” space which is 4-dimensional, so you can get a maximum of four components. In general, visit the URL

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/

to obtain a list of currently available chemical spaces, their names and dimensionality.

Caveat

While it’s easy to get all the components and visualize them, it doesn’t always make sense to do so. In general, one should consider those initial principal components that explain a significant portion of the variance (see Kaisers criterion). The service currently doesn’t provide the eigenvalues, so it’s not really possible to decide whether to go to 3, 4 or more components. For most cases, just looking at the first two principal components will sufficient – especially given the currently available chemical space.

Update (Jan 13, 2009)

Since the descriptor service now requires that Base64 encoded SMILES, the example usage URL is now invalid. Instead, the SMILES should be replaced by their encoded versions. In other words the first URL above becomes

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
YzFjY2NjYzE=,YzFjY2NjYzFDQw==,YzFjY2NjYzFDQ0M=,
Qyg9TylDKD1PKQ==,Q0MoPU8pTw==

Written by Rajarshi Guha

January 3rd, 2009 at 1:14 am

The ONS Challenge & Visualizing Chemical Space

with 5 comments

The ONSChallenge has been running for some time now and the simple web query form that tied in the data from Google Docs along with web services from IU has turned out to be pretty handy. With more and more data becoming available, I had done some initial exploratory analysis of the measured solubilities. One thing that is useful to the experimentalists is a suggestion of which compound to test next. This could be made on the basis of many factors – availability, ease of synthesis and so on. But one way to look at it is to examine what types of compounds have been tested previously, and suggest that the subsequent compounds be very different from those that have been tested.

Read the rest of this entry »

Written by Rajarshi Guha

December 30th, 2008 at 6:04 pm