Archive for the ‘fingerprint’ tag
Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.
His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.
I’ve just updated the fingerprint package to v3.5.0 (should show up on CRAN shortly, or else you can get it directly from my Github repository). The main update in this version is better support for feature,count type fingerprints. An example would be ECFP or signature fingerprints. In these types of fingerprints, the output is usually a set of (integer or long) hash values or else structural fragments along with their count of occurrences.
The updated package now provides an S4 class to represent features and their counts. An example of this class is
f1 <- new("feature",
The package provides getters and setters for these objects, allow you to get or set the feature and the count.
> feature(f1) <- 'ABCD'
> count(f1) <- 12
Using this class, feature,count fingerprints are now represented as objects of class featvec. For these fingerprints, instead of bits, one obtains a list of feature objects. For fingerprints read from files that provide the hashed version of the underlying structure (or neighborhood etc), the numeric hashes are read in as features, with a default count of 1. The distance method has also been updated to evaluate similarities for feature,count fingerprints, though currently it does not use the count in the similarity calculation.
As an example, consider a set of ECFP’s available from here
> fps <- fp.read('http://pastebin.com/raw.php?i=gHjTQNKP', lf=ecfp.lf, binary=FALSE)
name = mol01
source = ecfp.lf
features = 17:1 0:1 16:1 3:1 1:1 1747237384:1 1499521844:1 -1539132615:1 1294255210:1 332760439:1 -1549163031:1 1035613116:1 1618154665:1 590925877:1 1872154524:1 -1143715940:1 203677720:1 -1272768868:1 136120670:1 136597326:1 -1460348762:1 -1262922302:1 -1201618245:1 -402549409:1 -1270820019:1 929601590:1 -1597477966:1 -1274743746:1 -1155471474:1 1258428229:1 -1838187238:1 -798628285:1 -1773728142:1 -773983804:1 -453677277:1 1674451008:1 65948508:1 991735244:1 -1412946825:1 846704869:1 -2103621484:1 -886204842:1 1725648567:1 -353343892:1 -585443181:1 -533273616:1 2031084733:1 -801248129:1 1752802620:1 -976015189:1 -992213424:1 2109043264:1 -790336137:1 630139722:1 -505031736:1 -1427697183:1 -2090462286:1 -1724769936:1
> distance(fps[], fps[])
> distance(fps[], fps[])
I’ve just pushed a new version of the fingerprint package that contains an update provided by Abhik Seal that significantly speeds up calculation of pairwise similarity matrices when using the Dice similarity method. A ran a simple comparison using different numbers of random fingerprints (1024 bits, with 512 bits set to one, randomly) and measured the time to evaluate the pairwise similarity matrix. As you can see from the figure alongside, the new code is significantly faster (with speed ups of 450x to 500x). The code to generate the timings is below – it probably should wrapped in a loop to multiple times for each set size.
fpls <- lapply(seq(10,300,by=10),
function(x) random.fingerprint(1024, 512)))
times <- sapply(fpls,
function(fpl) system.time(fp.sim.matrix(fpl, method='dice')))
The other day I was exchanging emails with John Van Drie regarding open challenges in cheminformatics (which I’ll say more about later). One of his comments concerned the slow speed of chemical searches
Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in. Why are all chemical searches so sloooow? … Ideally, as you sketch your mol in, the searches should be happening at the same pace, like the typeahead feature.
Now, he doesn’t specifically mention what type of chemical search – it could be exact matches, similarity searches, substructure or pharmacophore searches. The first two can be done very quickly and lend themselves easily to type ahead type search interfaces. In light of the work my colleague has been doing, the substructure searches are now also amenable to a type ahead interface.
So I quickly put together a simple web page that lets you type in a SMILES (or SMARTS) and as you type it retrieves the results of a substructure search via the NCTT Search Server REST API. (In some cases the depiction is broken – that’s a bug on my side). Of course, typing in SMILES is not the most intuitive of interfaces. Since Trung employs the ChemDoodle sketcher, an ideal interface would respond to drawing events (say drawing a bond or adding atoms etc) and pull up matches on the fly. Another obvious extension is to rank (or filter) the results – all the while, maintaining the near real time speed of the application.
As I said before, seriously fast substructure searches. It also helps that I can build these examples via a public REST API. I’m sure there are reasons for SOAP, XML and so on. But it’s 2011. So lets help make extensions and mashups easier.
UPDATE: Yes, it’s easy to create patterns (especially with SMARTS) that DoS the server. We have some filters for excessively generic patterns; so some queries may not behave in the expected manner
My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).
It is blazingly fast.
I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.
The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.
The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.
Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.
Overall, very impressive. And it’s something you can play with yourself.