Archive for the ‘descriptor’ tag
Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it’s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.
A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).
I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.
I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.
The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).
So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.
Version 1.0.5 of the CDK descriptor calculator is now available. This version updates the command line batch mode to allow one to calculate a specific set of descriptors (as opposed to all or say, topological). The selected descriptors are specified using an XML file, which can be generated in the GUI mode – fire up the calculator in GUI mode, check the selected descriptors and then save the selection. You can then specify the selection file via the -s option.
My previous post did a quick comparison of the GSK anti-malarial screening dataset with a virtual library of Ugi products. That comparison was based on the PubChem fingerprints and indicated a broad degree of overlap. I was also interested in looking at the overlap in other feature spaces. The simplest way to do this is to evaluate a set of descriptors and then perform a principal components analysis. We can then plot the first two principal components to get an idea of the distribution of the compounds in the defined space.
I evaluated a number of descriptors using the CDK. In a physicochemical space represented by the number of rotatable bonds, molecular weight and XlogP values, a plot of the first two principal components looks as shown on the right. Given the large number of points, the plot is more of a blob, but does highlight the fact that there is a good degree of overlap between the two datasets. On going to a BCUT space on the left, we get a different picture, stressing the greater diversity of the GSK dataset. Of course, these are arbitary descriptor spaces and not necessarily meaningful. One would probably choose a descriptor space based on the problem at hand (and also the CDK XlogP implementation probably needs some work).
I was also interested in the promiscuity of the compounds in the GSK dataset. Promiscuity is the phenomenon where a molecule shows activity in multiple assays. Promiscuous activity could be indicate that the compound is truly active in all or most of the assays (i.e., hitting multiple distinct targets), but could also indicate that the activity is artifactual (such as if it were an aggregator or flourescent compound).
This analysis is performed by looking for those GSK molecules that are in the NCGC collection (272 exact matches) and checking to see how many NCGC assays they are tested in and whether they were active or not. Rather than look at all assays in the NCGC collection, I consider a subset of approximately 1300 assays curated by a colleague. Ideally, a compound will be active in only one (or a few) of the assays it is tested in.
For simplicities sake, I just plot the number of assays a compound is tested in versus the number of them that it is active in. The plot is colored by the activity (pXC50 value in the GSK SD file) so that more potent molecules are lighter. While the bulk of these molecules do not show significant promiscuous activity, a few of them do lie at the upper range. I’ve annotated four and their structures are shown below. Compound 530674 appears to be quite promiscuous given that it is active in 46 out of 84 assays it’s been tested in at the NCGC. On the other hand, 22942 is tested in 232 assays but is activity in 78 of them. This could be considered a low ratio, and isoquinolines have been noted to be non-promiscuous. (Both of these target kinases as noted in Gamo et al).
I’ve put out a new version (0.98) of the CDK descriptor calculator interface which uses the latest CDK master and also updates the save dialog for the descriptor selections to let the user specify a file name.