Faster Maximum Common Substructures

Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.

1
2
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CN(CC)CC
c1cc(c(cc1)N1CCN(CC1)C(=O)C1C(CN(C1)C(C)C)c1ccc(cc1)Cl)CNCCNC

While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.

The Quest for Universal Descriptors – Not There Yet

A major component of QSAR modeling is the choice of molecular descriptors that are used in a model. The literature is replete with descriptors and there’s lots of software (commercial and open source) to calculate them. There are many issues related to molecules descriptors, (such as many descriptors being correlated and so on) but I came across a paper by Frank Burden and co-workers describing a “universal descriptor”. What is such a descriptor?

The idea derives from the fact that molecular descriptors usually characterize one specific structural feature. But in many cases, the biological activity of a molecule is a function of multiple structural features. This implies that you need multiple descriptors to capture the entire structure-activity relationship. The goal of a universal descriptor set is that it should be able to characterize a molecular structure in such a way that it (implicitly or explicitly) encodes all the structural features that might be relevant for the molecules activity in multiple, diverse scenarios. In other words, a true universal descriptor set could be used in a variety QSAR models and not require additional descriptors.

One might ask whether this is feasible or not. But when we realize that in many cases biological activity is controlled by shape and electrostatics, it might make sense that a descriptor that characterizes these two features simultaneously should be a good candidate. Burden et al describe “charge fingerprints” which are claimed to be a step towards such a universal descriptor set.

These descriptors are essentially binned counts of partial charges on specific atoms. The method considers 7 atoms (H, C, N, O, P, S, Si) and for each atom declares 3 bins. Then for a given molecule, one simply bins the partial charges on the atoms. This results in a 18-element descriptor vector which can then be used in QSAR modeling. This is a very simple descriptor to implement (the authors implementation is commercially available, as far as I can see). They test it out on several large and diverse datasets and also compare these descriptors to atom count descriptors and BCUT‘s.

The results indicate that while similar in performance to things like BCUT’s, in the end combinations of these charge fingerprints with other descriptors perform best. OK, so that seems to preclude the charge fingerprints being universal in nature. The fact that the number of bins is an empirical choice based on the datasets they employed also seems like a factor that prevents the from being universal descriptors. And, shape isn’t considered. Given this point, it would have been interesting to see how these descriptors comapred to CPSA‘s. So while simple, interpretable and useful, it’s not clear why these would be considered universal.

Consolidating Services

Over the last few months I’ve been writing about REST services that I’ve made available. While useful they’ve been a hassle to maintain. The main reason is that many of the Python services (such as the depiction and descriptor services) are actually front-ends to SOAP services written in Java using the CDK. Thus, I have to maintain two layers of code. While deploying services using SOAP is easy (the service code is just plain Java classes and methods deployed in a Tomcat container) writing clients is not as cross-platform as SOAP afficionados make it out to be. For example, Python clients require ZSI which is rather complex for simple services, and last time I looked, does not support certain SOAP types. While SOAPpy is an easy to use option, it is no longer maintained. Furthermore, SOAP services on their own don’t necessarily provide a RESTful interface (hence the Python frontends).

Sometime back I came across a post by Rich where he mentioned Restlets. The idea is that this package makes writing RESTful services in Java easier than going through the full blown Servlet API. Indeed, it took a day or two to refactor my SOAP based services into RESTful services using Restlet. One of the nice things about the package is that while it allows you to integrate RESTful services into a Tomcat container (say for proxying purposes) you can also deploy the services via a small web server included in the package.

What is the result of all this? The short answer is that all the CDK SOAP services that I had are now packaged into a single JAR file that can be run from the command line:

1
java -jar cdkrest.jar -p 6666 -s rguha.ath.cx -l services.log

You can then access services such as http://rguha.ath.cx:6666/cdk/depict/CC=O. So there’s no need to run a Tomcat instance. Furthermore, by using Restlet, the services present a RESTful interface without requiring me to write an extra layer of Python. This also allows for easy distribution – just download the JAR file and you’re all set.

Along with refactoring code for the new Restlet based approach, I’ve been documenting the services as I go along, which is better than it being scattered over multiple blog posts. The new location for all the CDK services is http://rest.rguha.net which includes downloads and documentation. Note that this site will not actually run the services – see the site for more details.

Over time other services that are not necessarily CDK specific (such as database services) will also be migrated (in fact some of the old services are simply redirected to the new service), though certain services, such as predict, will remain as a Python service.

Substructure Matching, REST style

I’ve been putting up a number of REST services for a variety of cheminformatics tasks. One that was missing was substructure searching. In many scenarios it’s useful to be able to check whether a target molecule contains a query substructure or not. This can now be done by visiting URL’s of the form

1
http://rguha.ath.cx/~rguha/cicc/rest/substruct/TARGET/QUERY

where TARGET and QUERY are SMILES and SMARTS (or SMILES) respectively (appropriately escaped). If the query pattern is found in the target molecule then the resultant page contains the string “true” otherwise it contains the string “false”. The service uses OpenBabel to perform the SMARTS matching.

Using this service, I updated the ONS data query page to allow one to filter results by SMARTS patterns. This generally only makes sense when no specific solute is selected. However, filtering all the entries in the spreadsheet (i.e., any solvent, any solute) can be slow, since each molecule is matched against the SMARTS pattern using a separate HTTP requests. This could be easily fixed using POST, but it’s a hack anyway since this type of thing should probably be done in the database (i.e., Google Spreadsheet).

Update

The substructure search service is now updated to accept POST requests. As a result, it is possible to send in multiple SMILES strings and match them against a pattern all at one go. See the repository for a description on how to use the POST method. (The GET method is still supported but you can only match a pattern against one target SMILES). As a result, querying the ONS data using SMARTS pattens is significantly faster.

Papers About Systems You Can’t Use or Buy

Browsing the latest articles in JCIM, I came across one by Sander et al that discussed the design of a drug discovery informatics system employed at Actelion. The main claim to fame of the work appears to be the fact that it was built from scratch and so is vendor independent.

While somewhat interesting, one question jumped out at me: what is the value of this paper? The system is specific to this one company so it’s not like I can access the code or the workflows. I can’t even buy this software. In this case, they do provide public access to some of their tools, so it’s not a totally “opaque” paper. But there are other examples where one cannot really even try out the tools described in the paper.

While I see the value of the paper from the authors point of view (spreading the word, publications being “currency”, etc.), such papers have always felt a little pointless to me, as a reader. What can I do after reading this paper? Is there anything I can follow-up on?