Archive for the ‘cheminformatics’ Category
There are a number of scenarios when it’s useful to be able to classify protein targets – high level summaries, enrichment calculations and so on. There are a variety of protein classification schemes out there such as PANTHER, SCOP and InterPro. These schemes are based on domains and other structural features. ChEMBL provides it’s own hierarchical classification. Since I use this from time to time, it’s useful to pull all the classifications for a given species, at one go via the SQL below (tested with v17):
td.pref_name, description, accession, pfc . *
td.tax_id = 9606 AND td.tid = tc.tid
AND tc.component_id = cs.component_id
AND cc.component_id = cs.component_id
AND pfc.protein_class_id = cc.protein_class_id;
I came across a recent paper from the Tropsha group that discusses the issue of modelability – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of reliable is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.
The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state
we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation
I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!
Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.
The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.
One application of MODI that does come to mind is for feature selection – identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.
More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.
Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.
His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.
I’ve pushed updates to the rcdklibs and rcdk packages that support cheminformatics in R using the CDK. The new versions employ the latest CDK master, which as Egon pointed out has significantly fewer bugs, and thanks to Jon, improved performance. New additions to the package include support for the LINGO and Signature fingerprinters (you’ll need the latest version of fingerprint).
I’ve just updated the fingerprint package to v3.5.0 (should show up on CRAN shortly, or else you can get it directly from my Github repository). The main update in this version is better support for feature,count type fingerprints. An example would be ECFP or signature fingerprints. In these types of fingerprints, the output is usually a set of (integer or long) hash values or else structural fragments along with their count of occurrences.
The updated package now provides an S4 class to represent features and their counts. An example of this class is
f1 <- new("feature",
The package provides getters and setters for these objects, allow you to get or set the feature and the count.
> feature(f1) <- 'ABCD'
> count(f1) <- 12
Using this class, feature,count fingerprints are now represented as objects of class featvec. For these fingerprints, instead of bits, one obtains a list of feature objects. For fingerprints read from files that provide the hashed version of the underlying structure (or neighborhood etc), the numeric hashes are read in as features, with a default count of 1. The distance method has also been updated to evaluate similarities for feature,count fingerprints, though currently it does not use the count in the similarity calculation.
As an example, consider a set of ECFP’s available from here
> fps <- fp.read('http://pastebin.com/raw.php?i=gHjTQNKP', lf=ecfp.lf, binary=FALSE)
name = mol01
source = ecfp.lf
features = 17:1 0:1 16:1 3:1 1:1 1747237384:1 1499521844:1 -1539132615:1 1294255210:1 332760439:1 -1549163031:1 1035613116:1 1618154665:1 590925877:1 1872154524:1 -1143715940:1 203677720:1 -1272768868:1 136120670:1 136597326:1 -1460348762:1 -1262922302:1 -1201618245:1 -402549409:1 -1270820019:1 929601590:1 -1597477966:1 -1274743746:1 -1155471474:1 1258428229:1 -1838187238:1 -798628285:1 -1773728142:1 -773983804:1 -453677277:1 1674451008:1 65948508:1 991735244:1 -1412946825:1 846704869:1 -2103621484:1 -886204842:1 1725648567:1 -353343892:1 -585443181:1 -533273616:1 2031084733:1 -801248129:1 1752802620:1 -976015189:1 -992213424:1 2109043264:1 -790336137:1 630139722:1 -505031736:1 -1427697183:1 -2090462286:1 -1724769936:1
> distance(fps[], fps[])
> distance(fps[], fps[])