# So much to do, so little time

Trying to squeeze sense out of chemical data

## Consolidating Services

Over the last few months I’ve been writing about REST services that I’ve made available. While useful they’ve been a hassle to maintain. The main reason is that many of the Python services (such as the depiction and descriptor services) are actually front-ends to SOAP services written in Java using the CDK. Thus, I have to maintain two layers of code. While deploying services using SOAP is easy (the service code is just plain Java classes and methods deployed in a Tomcat container) writing clients is not as cross-platform as SOAP afficionados make it out to be. For example, Python clients require ZSI which is rather complex for simple services, and last time I looked, does not support certain SOAP types. While SOAPpy is an easy to use option, it is no longer maintained. Furthermore, SOAP services on their own don’t necessarily provide a RESTful interface (hence the Python frontends).

Sometime back I came across a post by Rich where he mentioned Restlets. The idea is that this package makes writing RESTful services in Java easier than going through the full blown Servlet API. Indeed, it took a day or two to refactor my SOAP based services into RESTful services using Restlet. One of the nice things about the package is that while it allows you to integrate RESTful services into a Tomcat container (say for proxying purposes) you can also deploy the services via a small web server included in the package.

What is the result of all this? The short answer is that all the CDK SOAP services that I had are now packaged into a single JAR file that can be run from the command line:

 1 java -jar cdkrest.jar -p 6666 -s rguha.ath.cx -l services.log

You can then access services such as http://rguha.ath.cx:6666/cdk/depict/CC=O. So there’s no need to run a Tomcat instance. Furthermore, by using Restlet, the services present a RESTful interface without requiring me to write an extra layer of Python. This also allows for easy distribution – just download the JAR file and you’re all set.

Along with refactoring code for the new Restlet based approach, I’ve been documenting the services as I go along, which is better than it being scattered over multiple blog posts. The new location for all the CDK services is http://rest.rguha.net which includes downloads and documentation. Note that this site will not actually run the services – see the site for more details.

Over time other services that are not necessarily CDK specific (such as database services) will also be migrated (in fact some of the old services are simply redirected to the new service), though certain services, such as predict, will remain as a Python service.

Written by Rajarshi Guha

February 9th, 2009 at 11:43 pm

Posted in software,cheminformatics

Tagged with , ,

## Deploying Predictive Models

Over the past few days I’ve been developing some predictive models in R, for the solubility data being generated as part of the ONS Solubility Challenge. As I develop the models I put up a brief summary of the results on the wiki. In the end however, we’d like to use these models to predict the solubility of untested compounds. While anybody can send me a SMILES string and get back a prediction, it’s more useful (and less work for me!) if a user can do it themselves. This requires that the models be deployed and made available as a web page or a service. Last year I developed a series of statistical web services based on R. The services were written in Java and are described in this paper. Since I’m working more with REST services these days, I wanted to see how easy it’d be to develop a model deployment system using Python, thus avoiding a multi-tiered system. With the help of rpy2, it turns out that this wasn’t very difficult.

### Read the rest of this entry »

Written by Rajarshi Guha

January 14th, 2009 at 9:23 pm

## Update to the REST Descriptor Services

The current version of the REST interface to the CDK descriptors allowed one to access descriptor values for a SMILES string by simply appending it to an URL, resulting in something like

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/c1ccccc1COCC

This type of URL is pretty handy to construct by hand. However, as Pat Walters pointed out in the comments to that post, SMILES containing ‘#’ will cause problems since that character is a URL fragment identifier. Furthermore, the presence of a ‘/’ in a SMILES string necessitates some processing in the service to recognize it as part of the SMILES, rather than a URL path separator. While the service could handle these (at the expense of messy code) it turned out that there were subtle bugs.

Based on Pats’ suggestion I converted the service to use base64 encoded SMILES, which let me simplify the code and remove the bugs. As a result, one cannot append the SMILES directly to the URL’s. Instead the above URL would be rewritten in the form

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/YzFjY2NjYzFDT0ND

All the example URL’s described in my previous post that involve SMILES strings, should be rewritten using base64 encoded SMILES. So to get a document listing all descriptors for “c1ccccc1COCC” one would write

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/YzFjY2NjYzFDT0ND

While this makes it a little harder to directly write out these URL’s by hand, I expect that most uses of this service would be programmatic – in which case getting base64 encoded SMILES is trivial.

Written by Rajarshi Guha

January 11th, 2009 at 5:52 pm

## Playing with REST Descriptor Services

As part of my work at IU I have been implementing a number of cheminformatics web services. Initially these were SOAP, but I realized that REST interfaces make life much easier. (also see here) As a result, a number of these services have simple REST interfaces. One such service provides molecular descriptor calculations, using the CDK as the backend. Thus by visiting  (i.e., making a HTTP GET request) a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/CC(=O)

you get a simple XML document containing a list of URL’s. Each URL represents a specific “resource”. In this context, the resource is the descriptor values for the given molecule. Thus by visiting

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/CC(=O)C

one gets another simple XML document that lists the names and values of the AlogP descriptor. In this case, the CDK implementation evaluates AlogP, AlogP2 and molar refractivity – so there are actually three descriptor values. On the other hand something like the  molecular weight descriptor gives a single value. To just see the list of available descriptors visit

http://www.chembiogrid.org/cheminfo/rest/desc/descriptors

which gives an XML document containing a series of links. Visiting one of these links gives the “descriptor specification” – information on the vendor, version, reference to a descriptor ontology and so on.

(I should point out that the descriptors available in this service are from a pretty old version of the CDK. I really should update the descriptors to the 1.2.x versions)

### Applications

This type of interface makes it easy to whip up various applications. One example is the PCA analysis of compound collections. Another one I put together today based on a conversation with Jean-Claude was a simple application to plot pairs of descriptor values for a collection of SMILES.

The app is pretty simple (and quite slow, since it uses synchronous GET’s to the descriptor service for each SMILES and has to make two calls for each SMILES – hey, it was a quick hack!). Currently, it’s a bit restrictive – if a descriptor calculates multiple values, it will only use the first value. To see how many values a molecular descriptor calculates, see the list here.

With a little more effort one could easily have a pretty nice online descriptor calculation application rivaling a standalone application such as the the CDK descriptor GUI

Also,if you struggle with nice CSS layouts, the CSS Layout Collection is a fantastic resource. And jQuery rocks.

Written by Rajarshi Guha

January 7th, 2009 at 7:06 am