CDK & logP Values

Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).

To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.

Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.

I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.

While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.

2 thoughts on “CDK & logP Values

  1. Eric Minikel says:

    Thanks for this. I was wondering whether to use alogp or xlogp from rcdk – from your post it sounds like xlogp is by far the better choice though still not great. I had originally ruled out xlogp because it throws a Java null pointer exception on aminophylline – here is a minimal script to reproduce the error:

    http://www.cureffi.org/wp-content/uploads/2013/10/xlogp-nullpointerexception.r.txt

    alogp on the other hand never throws errors, but it often returns NA values and I don’t know how to interpret these (for instance, I need to convert them to some numerical value for PCA). Any advice?

  2. Hi Eric, the problem your facing is that the aminophylline is not a single compound – it’s a 2:1 mixture of theophylline and ethelynediamine. The xlogp (and most other) descriptor only considers a single structure. I’m guessing theophylline would be the active ingredient and so the following should give a valid value

    theophylline = parse.smiles(“CN1C2=C(NC=N2)C(=O)N(C)C1=O”)
    eval.desc(aminophylline,”org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor”)

    of course, this doesn’t help you get the logp for aminophylline.

    On a related note, you should probably switch to the latest versions of fingerprint, rcdklibs and rcdk from the github repo. Updated to the latest cdk master, so faster and fewer bugs.

Leave a Reply

Your email address will not be published. Required fields are marked *