Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).
To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.
Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.
I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.
While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.