Cheminformatics and Non-Relational Datastores

Over the past year or so I’ve been seeing a variety of non-relational data stores coming up. They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS’s in that they do not require explicit schema defined a priori. While they do not offer transactional guarantees (ACID) compared to RDBMS’s, they claim flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a series on the use of CouchDB to store PubChem data.

Having used RDBMS’s such as PostgreSQL and Oracle for some time, I’ve wondered how or why one might use these systems for cheminformatics applications. Rich’s posts describe how one might go about using CouchDB to store SD files, but it wasn’t clear to me what advantage it provided over say, PostgreSQL.

I now realize that if you wanted to store arbitrary chemical data from multiple sources a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich’s post considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It’s certainly doable via SQL (put each field name into ‘dictionary’ table etc) but it seems a little hacky.

On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data – assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.

However some questions still remain. Most fundamentally, how does not having a schema affect query performance? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it’s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven’t RTFM so I’d be happy to be corrected!

However I’d expect that substructure search performance wouldn’t differ much between the two types of database systems. In fact, with the map/reduce features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of parallel queries).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).

So it looks like I’m going to have to play around and see how all this works.

9 thoughts on “Cheminformatics and Non-Relational Datastores

  1. Neil says:

    I’m working with mongodb at the moment (microarray data). My initial impression is that writes are very fast, queries can be slow. However, this is because neither my database nor my queries are optimized. I believe that without indexing, linear scans are required (and are slow), but indexes make everything much better. Map-reduce also helps, but you need sharding (splitting database across nodes) to get the benefit.

    There are also some excellent ORMs (or rather ODMs – object document mappers). I like mongoid (mongoid.org), which has a great DSL for powerful queries.

    I highly recommend mongodb, there is a lot of good documentation at their website on getting the most from it and easing the transition from relational.

  2. Andrew Perry says:

    In regards to retrieval speed for various queries (eg, MW < 250 Da) – in my understanding these key/value pair databases usually use some type of precalculated ‘index’, updated at document insertion time, that covers queries that you expect to make. Queries that use the index are fast .. arbitrary queries not anticipated with a precalculated index are slow. In Google App Engine these are called the ‘index’, and AFAIK in CouchDB these are called ‘views’.

  3. Thanks for the pointers. Happy to see that indexing is an integral feature

  4. Neil says:

    Andrew is right – for mongodb, the “_id” key is indexed by default, so query by document ID is fast (and _id is always returned by all queries). Any other key, you add it yourself. Composite keys, no problem.

  5. Ernst-Georg Schmid says:

    I’m currently playing around with Prevayler http://www.prevayler.org/ combined with MX or the CDK for handling chemical data.

    At the moment I can say that it works as expected at least with smaller datasets (~20000 structures), is easier to program (only Java, no SQL) and a lot (10x-100x) faster when searching.

    The main drawback is that you need enough memory to keep all your business objects in RAM and must implements efficient query strategies yourself. No optimizer will help you in Prevaylerland.

  6. Excellent thoughts ! I am curious seeing some performance for a couple of million documents. Besides, in which format are the documents typically stored? My last experience with hundred thousands of text files caused a server crash due to file system indexing saturation. On the other hand, if files are getting bundled in zip file, then indexing (direct jumps to data entries) is breaking down. Any experience with scalability and the number of possible documents?

  7. Ernst-Georg Schmid says:

    One thing to consider when deciding between RDBMS and a no-SQL database is reporting. Especially in corporate environments.

    1.) Most off the shelf reporting tools expect an SQL interface. Period.

    2.) While key-value stores and the like can compete or outrun RDBMS in straight searches or transactions, I wonder what happens when you need complex joins, projections?

    It’s all raw power and no finesse. Often raw power is just enough. But often you need some finesse also.

  8. […] my previous post I had mentioned that key/value or non-relational data stores could be useful in certain […]

  9. Neil says:

    Joerg, documents in MongoDB are stored in a format named BSON, or “binary JSON”. It’s a binary representation of a JSON data structure, so is optimized and fast (and limited to 4MB per document). No indexed/zipped text files!

Leave a Reply

Your email address will not be published. Required fields are marked *