# So much to do, so little time

Trying to squeeze sense out of chemical data

## Fingerprint Similarity Searches in MongoDB

A few of my recent projects have involved the use of MongoDB, primarily for the ease afforded by a schemaless environment. Sometime back I had investigated the use of MongoDB to store chemical structure data, though those efforts did not actually query structures per se; instead they queried for precomputed numeric or text properties. So my interest was piqued when I came across a post from Datablend that described how to use the aggregation framework to perform similarity searching using fingerprints. Specifically their approach employs an integer representation for fingerprints – these can represent bit positions or hash codes (for path based fingerprints). Another blog post indicates they are able to perform similarity searches over 30M molecules in milliseconds. So I was interested in seeing what type of performance I could get on a local installation, albeit with a smaller set of molecules. All the data and code to regenerate these results are available in the mongosim repository (you’ll need to unzip fp.txt for the loading and profiling scripts).

I extracted 1M compounds from ChEMBL v17 and used the CDK to evaluate the Signature fingerprint. This resulted in 993,620 fingerprints. These were loaded into MongoDB (v2.4.9) using the simple Python script

 12345678910111213141516171819202122232425262728 import pymongo, sys client = pymongo.MongoClient() db = client.sim coll = db.compounds x = open('fp.txt', 'r') x.readline() n = 0 docs = [] for line in x:     n += 1     if line.strip().find(" ") == -1: continue     molregno, bits = line.strip().split(" ")     bits = [int(x) for x in bits.split(",")]     doc = {"molregno":molregno,            "fp":bits,            "fpcount":len(bits),            "smi":""}     docs.append(doc)     if n % 5000 == 0:         coll.insert(docs)         docs = [] coll.create_index(['fpcount',pymongo.ASCENDING])

I then used the first 1000 fingerprints as queries – each time looking for the compounds in the database that exhibited a Tanimoto score greater than 0.9 with the query fingerprint. The aggregation pipeline is shown in profile.py and is pretty much the same as described in the Datablend post. I specifically implement the bounds described by Swamidass and Baldi (which I think Datablend also uses, but the reference seems wrong), allowing me to first filter on bit counts before doing the heavy lifting. All of this was run on a Macbook Pro with 16GB RAM and a single core.

The performance was surprisingly slow. Over a thousand queries, the median query time was 6332ms, with the 95th quantile query time being 7599ms. The Datablend post describing this approach indicated that it got them very good performance and their subsequent post about their Similr service indicates that they achieve millisecond query times on Pubchem sized (30M) collections. I assume there are memory tweaks along with sharding that could let one acheive this level of performance, but there don’t appear to be any details.

I should point out that NCATS has already released code to allow fast similarity search using an in-memory fingerprint index, that supports millisecond query times over Pubchem sized collections.

Written by Rajarshi Guha

July 23rd, 2014 at 2:44 pm

## High Content Screens and Multivariate Z’

While contributing to a book chapter on high content screening I came across the problem of characterizing screen quality. In a traditional assay development scenario the Z factor (or Z’) is used as one of the measures of assay performance (using the positive and negative control samples). The definition of Z’ is based on a 1-D readout, which is the case with most non-high content screens. But what happens when we have to deal with 10 or 20 readouts, which can commonly occur in a high content screen?

Assuming one has identified a small set of biologically relevant phenotypic parameters (from the tens or hundreds spit out by HCA software), it makes sense that one measure the assay performance in terms of the overall biology, rather than one specific aspect of the biology. In other words, a useful performance measure should be able to take into account multiple (preferably orthogonal) readouts. In fact, in many high content screening assays, the use of the traditional Z’ with a single readout leads to very low values suggesting a poor quality assay, when in fact, that is not the case if one were to consider the overall biology.

One approach that has been described in the literature is an extension of the Z’, termed the multivariate Z’. The approach was first described by Kummel et al, which develops an LDA model, trained on the positive and negative wells. Each well is described by N phenotypic parameters and the assumption is that one has pre-selected these parameters to be meaningful and relevant. The key to using the model for a Z’ calculation is to replace the N-dimensional values for a given well by the 1-dimensional linear projection of that well:

$P_i = \sum_{j=1}^{D} w_j x_{ij}$

where $P_i$ is the 1-D projected value, $w_j$ is the weight for the $j$‘th pheontypic parameter and $x_{ij}$ is the value of the $j$‘th parameter for the $i$‘th well.

The projected value is then used in the Z’ calculation as usual. Kummel et al showed that this approach leads to better (i.e., higher) Z’ values compared to the use of univariate Z’. Subsequently, Kozak & Csucs extended this approach and used a kernel method to project the N-dimensional well values in a non-linear manner. Unsurprisingly, they show a better Z’ than what would be obtained via a linear projection.

And this is where I have my beef with these methods. In fact, a number of beefs:

• These methods are model based and so can suffer from over-fitting. No checks were made and if over-fitting were to occur one would obtain a falsely optimistic Z’
• These methods assert success when they perform better than a univariate Z’ or when a non-linear projection does better than a linear projection. But neither comparison is a true indication that they have captured the assay performance in an absolute sense. In other words, what is the “ground truth” that one should be aiming for, when developing multivariate Z’ methods? Given that the upper bound of Z’ is 1.0, one can imagine developing methods that give you increasing Z’ values – but does a method that gives Z’ close to 1 really mean a better assay?  It seems that published efforts are measured relative to other implementations and not necessarily to an actual assay quality (however that is characterized).
• While the fundamental idea of separation of positive and negative control reponses as a measure of assay performance is good, methods that are based on learning this separation are at risk of generating overly optimistic assesments of performance.

## A counter-example

As an example, I looked at a recent high content siRNA screen we ran that had 104 parameters associated with it. The first figure shows the Z’ calculated using each layer individually (excluding layers with abnormally low Z’)

As you can see, the highest Z’ is about 0.2. After removing those with no variation and members of correlated pairs I ended up with a set of 15 phenotypic parameters. If we compare the per-parameter distributions of the positive and negative control responses, we see very poor separation in all layers but one, as shown in the density plots below (the scales are all independent)

I then used these 15 parameters to build an LDA model and obtain a multivariate Z’ as described by Kummel et al. Now, the multivariate Z’ turns out to be 0.68, suggesting a well performing assay. I also performed MDS on the 15 parameter set to get lower dimensional (3D, 4D, 5D, 6D etc) datasets and performed the same calculation, leading to similar Z’ values (0.41 – 0.58)

But in fact, from the biological point of view, the assay performance was quite poor due to poor performance of the positive control (we haven’t found a good one yet). In practice then, the model based multivariate Z’ (at least as described by Kummel et al can be misleading. One could argue that I had not chosen an appropriate set of phenotypic parameters – but I checkout a variety of other subsets (though not exhaustively) and I got similar Z’ values.

## Alternatives

Of course, it’s easy to complain and while I haven’t worked out a rigorous alternative, the idea of describing the distance between multivariate distributions as a measure of assay performance (as opposed to learning the separation) allows us to attack the problem in a variety of ways. There is a nice discussion on StackExchange regarding this exact question. Some possibilities include

It might be useful to perform a more comprehensive investigation of these methods as a way to measure assay performance

Written by Rajarshi Guha

September 9th, 2012 at 8:03 pm

## Cheminformatics and Non-Relational Datastores

Over the past year or so I’ve been seeing a variety of non-relational data stores coming up. They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS’s in that they do not require explicit schema defined a priori. While they do not offer transactional guarantees (ACID) compared to RDBMS’s, they claim flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a series on the use of CouchDB to store PubChem data.

Having used RDBMS’s such as PostgreSQL and Oracle for some time, I’ve wondered how or why one might use these systems for cheminformatics applications. Rich’s posts describe how one might go about using CouchDB to store SD files, but it wasn’t clear to me what advantage it provided over say, PostgreSQL.

I now realize that if you wanted to store arbitrary chemical data from multiple sources a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich’s post considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It’s certainly doable via SQL (put each field name into ‘dictionary’ table etc) but it seems a little hacky.

On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data – assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.

However some questions still remain. Most fundamentally, how does not having a schema affect query performance? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it’s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven’t RTFM so I’d be happy to be corrected!

However I’d expect that substructure search performance wouldn’t differ much between the two types of database systems. In fact, with the map/reduce features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of parallel queries).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).

So it looks like I’m going to have to play around and see how all this works.

Written by Rajarshi Guha

February 4th, 2010 at 5:51 am

## Performance of AtomContainerSet versus ArrayList

In my previous post, I had asked whether we really need AtomContainerSet and other related specialized container classes as opposed to using parametrized List objects. Gilleain had mentioned some issues that might require these specialized classes. But at this point it’s not clear to me what the intended goal for these classes was.

For now, assuming that they are used purely as storage classes, I decided to do some performance testing, comparing AtomContainerSet to ArrayList<IAtomContainer>. Currently these results are based on System.currentTimeMillis() as my YourKit licence has expired. So the results are not as fine grained as I’d like.

To perform the tests I read in 2000 3D structures taken from Pub3D and loaded them into an IAtomContainer[]. I then considered four operations

• Given an empty AtomContainerSet, we simply loop over the 2000-element array and add each molecule to the empty container using addAtomContainer. We do the same thing with an empty ArrayList<IAtomContainer>, but use add. The operation is repeated 100 times and the mean time for the addition of 2000 molecules to the empty containers is taken.
• Randomly access elements of the AtomContainerSet and ArrayList, 50,000 times
• Access each element serially (using for (IAtomContainer x : container) idiom). Averaged 100 times
• Remove 500 molecules by reference, randomly from each container structure
• Remove 500 molecules by index, randomly from each container structure

The summary view of the results is shown below. For each pair of bars, the time taken for the operation on the AtomContainerSet is normalized to 1.0 and the time with the ArrayList is appropriately converted.

The raw timings are shown below:

 Operation AtomContainerSet (sec) ArrayList (sec) Add 0.0036 0.0001 Indexed Random Access 0.006 0.007 Indexed Serially 0.007 0.012 Remove (by reference) 68.119 0.1 Remove (by index) 0.695 0.0

As you can see, the ArrayList is significantly faster in all except random and serial access. In that case, if you consider the raw numbers, the run times are essentially equivalent for random access, though serial access of an array in AtomContainerSet gives it an edge over ArrayList.

So if AtomContainerSet and related classes are purely for storage purposes, they could easily be replaced by ArrayList’s. However the fact that they include support for notifications suggests somebody is using it for more complex scenarios. In either case, a discussion of the design and role of these classes would be useful

Written by Rajarshi Guha

April 15th, 2009 at 5:27 pm