Archive for the ‘qsar’ tag
I’ve put out a new version (0.98) of the CDK descriptor calculator interface which uses the latest CDK master and also updates the save dialog for the descriptor selections to let the user specify a file name.
Some time back, John Van Drie and myself had done some work on characterizing structure-activity cliffs, which are molecules that have very similar structures but very different activities. The term originated from Maggiora, who suggested that this was a reason for the failure of many QSAR models. At the same time, such cliffs can represent the most interesting portions of a structure-activity relationship. Our first paper introduced the idea of the Structure Activity Landscape Index (SALI) – a numerical metric to characterize activity cliffs and SAR landscapes, and was followed by a paper describing how one could use SALI to characterize the quality of predictive models. We also put out some software to let others use SALI to analyse their own data.
A few weeks back I heard from Jinhua Zhang at Simulations Plus Inc., that they had implented the SALI methodology in their ADMET Predictor product. He also provided some slides showing how they used SALI to characterize a variety of predictive models for hERG inhibition.
It’s a nice feeling to see our theoretical work being incorporated into a real world product!
A major component of QSAR modeling is the choice of molecular descriptors that are used in a model. The literature is replete with descriptors and there’s lots of software (commercial and open source) to calculate them. There are many issues related to molecules descriptors, (such as many descriptors being correlated and so on) but I came across a paper by Frank Burden and co-workers describing a “universal descriptor”. What is such a descriptor?
The idea derives from the fact that molecular descriptors usually characterize one specific structural feature. But in many cases, the biological activity of a molecule is a function of multiple structural features. This implies that you need multiple descriptors to capture the entire structure-activity relationship. The goal of a universal descriptor set is that it should be able to characterize a molecular structure in such a way that it (implicitly or explicitly) encodes all the structural features that might be relevant for the molecules activity in multiple, diverse scenarios. In other words, a true universal descriptor set could be used in a variety QSAR models and not require additional descriptors.
One might ask whether this is feasible or not. But when we realize that in many cases biological activity is controlled by shape and electrostatics, it might make sense that a descriptor that characterizes these two features simultaneously should be a good candidate. Burden et al describe “charge fingerprints” which are claimed to be a step towards such a universal descriptor set.
These descriptors are essentially binned counts of partial charges on specific atoms. The method considers 7 atoms (H, C, N, O, P, S, Si) and for each atom declares 3 bins. Then for a given molecule, one simply bins the partial charges on the atoms. This results in a 18-element descriptor vector which can then be used in QSAR modeling. This is a very simple descriptor to implement (the authors implementation is commercially available, as far as I can see). They test it out on several large and diverse datasets and also compare these descriptors to atom count descriptors and BCUT‘s.
The results indicate that while similar in performance to things like BCUT’s, in the end combinations of these charge fingerprints with other descriptors perform best. OK, so that seems to preclude the charge fingerprints being universal in nature. The fact that the number of bins is an empirical choice based on the datasets they employed also seems like a factor that prevents the from being universal descriptors. And, shape isn’t considered. Given this point, it would have been interesting to see how these descriptors comapred to CPSA‘s. So while simple, interpretable and useful, it’s not clear why these would be considered universal.
Using the model deployment and prediction service, I put up the two linear regression models I had built so far (described in more detail here) While REST is nice, a simple web page that allows you to paste a set of SMILES and get back predictions is handy. So I whipped together a simple interface to the prediction service, allowing one to select a model, view the author-generated description and a get a nice (sortable!) table of predicted values. View it here. As noted in my previous post it’s not going to be very fast, but hopefully that will change in the future.
Over the past few days I’ve been developing some predictive models in R, for the solubility data being generated as part of the ONS Solubility Challenge. As I develop the models I put up a brief summary of the results on the wiki. In the end however, we’d like to use these models to predict the solubility of untested compounds. While anybody can send me a SMILES string and get back a prediction, it’s more useful (and less work for me!) if a user can do it themselves. This requires that the models be deployed and made available as a web page or a service. Last year I developed a series of statistical web services based on R. The services were written in Java and are described in this paper. Since I’m working more with REST services these days, I wanted to see how easy it’d be to develop a model deployment system using Python, thus avoiding a multi-tiered system. With the help of rpy2, it turns out that this wasn’t very difficult.