Fingerprint Similarity Searches in MongoDB

A few of my recent projects have involved the use of MongoDB, primarily for the ease afforded by a schemaless environment. Sometime back I had investigated the use of MongoDB to store chemical structure data, though those efforts did not actually query structures per se; instead they queried for precomputed numeric or text properties. So my interest was piqued when I came across a post from Datablend that described how to use the aggregation framework to perform similarity searching using fingerprints. Specifically their approach employs an integer representation for fingerprints – these can represent bit positions or hash codes (for path based fingerprints). Another blog post indicates they are able to perform similarity searches over 30M molecules in milliseconds. So I was interested in seeing what type of performance I could get on a local installation, albeit with a smaller set of molecules. All the data and code to regenerate these results are available in the mongosim repository (you’ll need to unzip fp.txt for the loading and profiling scripts).

I extracted 1M compounds from ChEMBL v17 and used the CDK to evaluate the Signature fingerprint. This resulted in 993,620 fingerprints. These were loaded into MongoDB (v2.4.9) using the simple Python script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pymongo, sys

client = pymongo.MongoClient()
db = client.sim
coll = db.compounds

x = open('fp.txt', 'r')
x.readline()
n = 0

docs = []
for line in x:
    n += 1

    if line.strip().find(" ") == -1: continue
    molregno, bits = line.strip().split(" ")
    bits = [int(x) for x in bits.split(",")]

    doc = {"molregno":molregno,
           "fp":bits,
           "fpcount":len(bits),
           "smi":""}
    docs.append(doc)
    if n % 5000 == 0:
        coll.insert(docs)
        docs = []

coll.create_index(['fpcount',pymongo.ASCENDING])

I then used the first 1000 fingerprints as queries – each time looking for the compounds in the database that exhibited a Tanimoto score greater than 0.9 with the query fingerprint. The aggregation pipeline is shown in profile.py and is pretty much the same as described in the Datablend post. I specifically implement the bounds described by Swamidass and Baldi (which I think Datablend also uses, but the reference seems wrong), allowing me to first filter on bit counts before doing the heavy lifting. All of this was run on a Macbook Pro with 16GB RAM and a single core.

The performance was surprisingly slow. Over a thousand queries, the median query time was 6332ms, with the 95th quantile query time being 7599ms. The Datablend post describing this approach indicated that it got them very good performance and their subsequent post about their Similr service indicates that they achieve millisecond query times on Pubchem sized (30M) collections. I assume there are memory tweaks along with sharding that could let one acheive this level of performance, but there don’t appear to be any details.

I should point out that NCATS has already released code to allow fast similarity search using an in-memory fingerprint index, that supports millisecond query times over Pubchem sized collections.

Leave a Reply

Your email address will not be published. Required fields are marked *