So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘descriptors’ tag

CDKDescUI Updates – DnD & Batch Mode

with 2 comments

I’ve put out an updated version (1.0.1) of the CDK descriptor calculator that now supports drag ‘n drop of the input file – just drag an appropriate file onto the UI and the input file text field should be automatically populated. In addition, all file dialogs let OS X users specify a file name manually.

The current version also supports a, frequently requested, command line batch mode. It’s a little limited compared to the GUI since you can’t specify individual descriptors, only descriptor categories (such as ‘all’, ‘topological’ etc) and the only output format is tab delimited.

$ java -jar CDKDescUI.jar -h

usage: cdkdescui [OPTIONS] inputfile
 -b    Batch mode
 -h    Help
 -o    Output file
 -t    Descriptor type: all, topological, geometric, constitutional,
       electronic, hybrid
 -v    Verbose output

CDKDescUI v1.0.1 Rajarshi Guha <>

By default, output is dumped to output.txt and all descriptors are evaluated. If errors occur for a given molecule and descriptor they are reported at the end (i.e., the program continues)

Written by Rajarshi Guha

April 4th, 2010 at 5:01 pm

Posted in software

Tagged with , ,

The Quest for Universal Descriptors – Not There Yet

with 2 comments

A major component of QSAR modeling is the choice of molecular descriptors that are used in a model. The literature is replete with descriptors and there’s lots of software (commercial and open source) to calculate them. There are many issues related to molecules descriptors, (such as many descriptors being correlated and so on) but I came across a paper by Frank Burden and co-workers describing a “universal descriptor”. What is such a descriptor?

The idea derives from the fact that molecular descriptors usually characterize one specific structural feature. But in many cases, the biological activity of a molecule is a function of multiple structural features. This implies that you need multiple descriptors to capture the entire structure-activity relationship. The goal of a universal descriptor set is that it should be able to characterize a molecular structure in such a way that it (implicitly or explicitly) encodes all the structural features that might be relevant for the molecules activity in multiple, diverse scenarios. In other words, a true universal descriptor set could be used in a variety QSAR models and not require additional descriptors.

One might ask whether this is feasible or not. But when we realize that in many cases biological activity is controlled by shape and electrostatics, it might make sense that a descriptor that characterizes these two features simultaneously should be a good candidate. Burden et al describe “charge fingerprints” which are claimed to be a step towards such a universal descriptor set.

These descriptors are essentially binned counts of partial charges on specific atoms. The method considers 7 atoms (H, C, N, O, P, S, Si) and for each atom declares 3 bins. Then for a given molecule, one simply bins the partial charges on the atoms. This results in a 18-element descriptor vector which can then be used in QSAR modeling. This is a very simple descriptor to implement (the authors implementation is commercially available, as far as I can see). They test it out on several large and diverse datasets and also compare these descriptors to atom count descriptors and BCUT‘s.

The results indicate that while similar in performance to things like BCUT’s, in the end combinations of these charge fingerprints with other descriptors perform best. OK, so that seems to preclude the charge fingerprints being universal in nature. The fact that the number of bins is an empirical choice based on the datasets they employed also seems like a factor that prevents the from being universal descriptors. And, shape isn’t considered. Given this point, it would have been interesting to see how these descriptors comapred to CPSA‘s. So while simple, interpretable and useful, it’s not clear why these would be considered universal.

Written by Rajarshi Guha

February 14th, 2009 at 1:09 am

Posted in Literature

Tagged with , ,

Update to the REST Descriptor Services

with 2 comments

The current version of the REST interface to the CDK descriptors allowed one to access descriptor values for a SMILES string by simply appending it to an URL, resulting in something like

This type of URL is pretty handy to construct by hand. However, as Pat Walters pointed out in the comments to that post, SMILES containing ‘#’ will cause problems since that character is a URL fragment identifier. Furthermore, the presence of a ‘/’ in a SMILES string necessitates some processing in the service to recognize it as part of the SMILES, rather than a URL path separator. While the service could handle these (at the expense of messy code) it turned out that there were subtle bugs.

Based on Pats’ suggestion I converted the service to use base64 encoded SMILES, which let me simplify the code and remove the bugs. As a result, one cannot append the SMILES directly to the URL’s. Instead the above URL would be rewritten in the form

All the example URL’s described in my previous post that involve SMILES strings, should be rewritten using base64 encoded SMILES. So to get a document listing all descriptors for “c1ccccc1COCC” one would write

and then follow the links therein.

While this makes it a little harder to directly write out these URL’s by hand, I expect that most uses of this service would be programmatic – in which case getting base64 encoded SMILES is trivial.

Written by Rajarshi Guha

January 11th, 2009 at 5:52 pm