A major component of QSAR modeling is the choice of molecular descriptors that are used in a model. The literature is replete with descriptors and there’s lots of software (commercial and open source) to calculate them. There are many issues related to molecules descriptors, (such as many descriptors being correlated and so on) but I came across a paper by Frank Burden and co-workers describing a “universal descriptor”. What is such a descriptor?
The idea derives from the fact that molecular descriptors usually characterize one specific structural feature. But in many cases, the biological activity of a molecule is a function of multiple structural features. This implies that you need multiple descriptors to capture the entire structure-activity relationship. The goal of a universal descriptor set is that it should be able to characterize a molecular structure in such a way that it (implicitly or explicitly) encodes all the structural features that might be relevant for the molecules activity in multiple, diverse scenarios. In other words, a true universal descriptor set could be used in a variety QSAR models and not require additional descriptors.
One might ask whether this is feasible or not. But when we realize that in many cases biological activity is controlled by shape and electrostatics, it might make sense that a descriptor that characterizes these two features simultaneously should be a good candidate. Burden et al describe “charge fingerprints” which are claimed to be a step towards such a universal descriptor set.
These descriptors are essentially binned counts of partial charges on specific atoms. The method considers 7 atoms (H, C, N, O, P, S, Si) and for each atom declares 3 bins. Then for a given molecule, one simply bins the partial charges on the atoms. This results in a 18-element descriptor vector which can then be used in QSAR modeling. This is a very simple descriptor to implement (the authors implementation is commercially available, as far as I can see). They test it out on several large and diverse datasets and also compare these descriptors to atom count descriptors and BCUT‘s.
The results indicate that while similar in performance to things like BCUT’s, in the end combinations of these charge fingerprints with other descriptors perform best. OK, so that seems to preclude the charge fingerprints being universal in nature. The fact that the number of bins is an empirical choice based on the datasets they employed also seems like a factor that prevents the from being universal descriptors. And, shape isn’t considered. Given this point, it would have been interesting to see how these descriptors comapred to CPSA‘s. So while simple, interpretable and useful, it’s not clear why these would be considered universal.
Did you implement them for the CDK yet? Was thinking to do that, and might find some time this weekend…
BTW, one thing I do not like about this, is that it only applies to molecules with the 7 elements… any other atom will surely have effect on any property too, but is just disregarded. So, I’m sure a CDK implementation would support the 7 as default, but the full PT as extension.
No, I haven’t done it – but if you want to I’ll be happy to defer to you
I agree that going for the full PT would be nice – but it would correspondingly increase the descriptor size (cf their comments regarding flourine).
But a bigger issue is that G-M charges for pi systems does not work very well, last time I looked. Has there been updates to the partial charge classes?