Archive for the ‘software’ Category
With the profusion of chemical information on the web – in the form of chemical names, images of structures, specific codes (InChI etc), it’s sometimes very useful to be able to seamlessly retrieve some extra information while browsing a page that contains such entities. The usual way is to copy the InChI/SMILES/CAS/name string and paste into Pubchem, Chemspider and so on.
However, a much smoother way is now available via a Firefox extension, called NCATSFind, developed by my colleague. It’s a one click install and once installed, automatically identifies a variety of chemical id codes (CAS number, InChI, UNII) and when such entities are identified uses a variety of backend services to provide context. In addition, it has a cool feature that lets you select an image and generate a structure (using OSRA in the background).
Check out his blog post for more details.
Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.
His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.
I’ve pushed updates to the rcdklibs and rcdk packages that support cheminformatics in R using the CDK. The new versions employ the latest CDK master, which as Egon pointed out has significantly fewer bugs, and thanks to Jon, improved performance. New additions to the package include support for the LINGO and Signature fingerprinters (you’ll need the latest version of fingerprint).
I’ve just updated the fingerprint package to v3.5.0 (should show up on CRAN shortly, or else you can get it directly from my Github repository). The main update in this version is better support for feature,count type fingerprints. An example would be ECFP or signature fingerprints. In these types of fingerprints, the output is usually a set of (integer or long) hash values or else structural fragments along with their count of occurrences.
The updated package now provides an S4 class to represent features and their counts. An example of this class is
f1 <- new("feature",
The package provides getters and setters for these objects, allow you to get or set the feature and the count.
> feature(f1) <- 'ABCD'
> count(f1) <- 12
Using this class, feature,count fingerprints are now represented as objects of class featvec. For these fingerprints, instead of bits, one obtains a list of feature objects. For fingerprints read from files that provide the hashed version of the underlying structure (or neighborhood etc), the numeric hashes are read in as features, with a default count of 1. The distance method has also been updated to evaluate similarities for feature,count fingerprints, though currently it does not use the count in the similarity calculation.
As an example, consider a set of ECFP’s available from here
> fps <- fp.read('http://pastebin.com/raw.php?i=gHjTQNKP', lf=ecfp.lf, binary=FALSE)
name = mol01
source = ecfp.lf
features = 17:1 0:1 16:1 3:1 1:1 1747237384:1 1499521844:1 -1539132615:1 1294255210:1 332760439:1 -1549163031:1 1035613116:1 1618154665:1 590925877:1 1872154524:1 -1143715940:1 203677720:1 -1272768868:1 136120670:1 136597326:1 -1460348762:1 -1262922302:1 -1201618245:1 -402549409:1 -1270820019:1 929601590:1 -1597477966:1 -1274743746:1 -1155471474:1 1258428229:1 -1838187238:1 -798628285:1 -1773728142:1 -773983804:1 -453677277:1 1674451008:1 65948508:1 991735244:1 -1412946825:1 846704869:1 -2103621484:1 -886204842:1 1725648567:1 -353343892:1 -585443181:1 -533273616:1 2031084733:1 -801248129:1 1752802620:1 -976015189:1 -992213424:1 2109043264:1 -790336137:1 630139722:1 -505031736:1 -1427697183:1 -2090462286:1 -1724769936:1
> distance(fps[], fps[])
> distance(fps[], fps[])
Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.
But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.