Over the past year or so I’ve been seeing a variety of non-relational data stores coming up. They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS’s in that they do not require explicit schema defined a priori. While they do not offer transactional guarantees (ACID) compared to RDBMS’s, they claim flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a series on the use of CouchDB to store PubChem data.
Having used RDBMS’s such as PostgreSQL and Oracle for some time, I’ve wondered how or why one might use these systems for cheminformatics applications. Rich’s posts describe how one might go about using CouchDB to store SD files, but it wasn’t clear to me what advantage it provided over say, PostgreSQL.
I now realize that if you wanted to store arbitrary chemical data from multiple sources a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich’s post considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It’s certainly doable via SQL (put each field name into ‘dictionary’ table etc) but it seems a little hacky.
On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data – assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.
However some questions still remain. Most fundamentally, how does not having a schema affect query performance? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it’s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven’t RTFM so I’d be happy to be corrected!
However I’d expect that substructure search performance wouldn’t differ much between the two types of database systems. In fact, with the map/reduce features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of parallel queries).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).
So it looks like I’m going to have to play around and see how all this works.