# So much to do, so little time

Trying to squeeze sense out of chemical data

## Playing with REST Descriptor Services

As part of my work at IU I have been implementing a number of cheminformatics web services. Initially these were SOAP, but I realized that REST interfaces make life much easier. (also see here) As a result, a number of these services have simple REST interfaces. One such service provides molecular descriptor calculations, using the CDK as the backend. Thus by visiting  (i.e., making a HTTP GET request) a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/CC(=O)

you get a simple XML document containing a list of URL’s. Each URL represents a specific “resource”. In this context, the resource is the descriptor values for the given molecule. Thus by visiting

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/CC(=O)C

one gets another simple XML document that lists the names and values of the AlogP descriptor. In this case, the CDK implementation evaluates AlogP, AlogP2 and molar refractivity – so there are actually three descriptor values. On the other hand something like the  molecular weight descriptor gives a single value. To just see the list of available descriptors visit

http://www.chembiogrid.org/cheminfo/rest/desc/descriptors

which gives an XML document containing a series of links. Visiting one of these links gives the “descriptor specification” – information on the vendor, version, reference to a descriptor ontology and so on.

(I should point out that the descriptors available in this service are from a pretty old version of the CDK. I really should update the descriptors to the 1.2.x versions)

### Applications

This type of interface makes it easy to whip up various applications. One example is the PCA analysis of compound collections. Another one I put together today based on a conversation with Jean-Claude was a simple application to plot pairs of descriptor values for a collection of SMILES.

The app is pretty simple (and quite slow, since it uses synchronous GET’s to the descriptor service for each SMILES and has to make two calls for each SMILES – hey, it was a quick hack!). Currently, it’s a bit restrictive – if a descriptor calculates multiple values, it will only use the first value. To see how many values a molecular descriptor calculates, see the list here.

With a little more effort one could easily have a pretty nice online descriptor calculation application rivaling a standalone application such as the the CDK descriptor GUI

Also,if you struggle with nice CSS layouts, the CSS Layout Collection is a fantastic resource. And jQuery rocks.

Written by Rajarshi Guha

January 7th, 2009 at 7:06 am

## Quick Comments on an Analysis of Antithrombotics

Joerg has made a nice blog post on the use of Open Source software and data to analyse the occurence of antithrombotics. More specifically he was trying to answer the question

Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?

Using Blue Obelisk tools and ChemSpider and where Fontaine et al. refers to the Fontaine Factor Xa dataset. You should read his post for a nice analysis of the problem. I just wanted to consider two points he had raised.

Written by Rajarshi Guha

January 5th, 2009 at 1:36 am

## Extending the REST PCA Service

I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O

one would get a HTML, plain text or JSON page containing the first two principal components for the molecules specified. With this data one can generate a simple 2D plot of the distributions of molecules in the “default” chemical space.

However, as Andrew Lang pointed out on FriendFeed, one could use SecondLife to look at 3D versions of the PCA results. So I updatesd the service to allow one to specify the number of components in the URL. The above form of the service will still work – you get the first two components by default.

To specify more components use an URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

where mol1, mol2, mol3 etc should be valid SMILES strings. The above URL will return the first three PC’s. To get just the first PC, replace the 3 with 1 and so on. If more components are requested than available, all components are returned.

Currently, the only available space is the “default” space which is 4-dimensional, so you can get a maximum of four components. In general, visit the URL

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/

to obtain a list of currently available chemical spaces, their names and dimensionality.

### Caveat

While it’s easy to get all the components and visualize them, it doesn’t always make sense to do so. In general, one should consider those initial principal components that explain a significant portion of the variance (see Kaisers criterion). The service currently doesn’t provide the eigenvalues, so it’s not really possible to decide whether to go to 3, 4 or more components. For most cases, just looking at the first two principal components will sufficient – especially given the currently available chemical space.

### Update (Jan 13, 2009)

Since the descriptor service now requires that Base64 encoded SMILES, the example usage URL is now invalid. Instead, the SMILES should be replaced by their encoded versions. In other words the first URL above becomes

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
YzFjY2NjYzE=,YzFjY2NjYzFDQw==,YzFjY2NjYzFDQ0M=,
Qyg9TylDKD1PKQ==,Q0MoPU8pTw==

Written by Rajarshi Guha

January 3rd, 2009 at 1:14 am