Archive for the ‘database’ tag
Over the past year or so I’ve been seeing a variety of non-relational data stores coming up. They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS’s in that they do not require explicit schema defined a priori. While they do not offer transactional guarantees (ACID) compared to RDBMS’s, they claim flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a series on the use of CouchDB to store PubChem data.
Having used RDBMS’s such as PostgreSQL and Oracle for some time, I’ve wondered how or why one might use these systems for cheminformatics applications. Rich’s posts describe how one might go about using CouchDB to store SD files, but it wasn’t clear to me what advantage it provided over say, PostgreSQL.
I now realize that if you wanted to store arbitrary chemical data from multiple sources a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich’s post considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It’s certainly doable via SQL (put each field name into ‘dictionary’ table etc) but it seems a little hacky.
On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data – assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.
However some questions still remain. Most fundamentally, how does not having a schema affect query performance? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it’s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven’t RTFM so I’d be happy to be corrected!
However I’d expect that substructure search performance wouldn’t differ much between the two types of database systems. In fact, with the map/reduce features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of parallel queries).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).
So it looks like I’m going to have to play around and see how all this works.
Sometime back I had mentioned a new cheminformatics toolkit, Indigo. Recently, Dmitry from SciTouch let me know that they had also developed Bingo, an Oracle cartridge based on Indigo, to perform cheminformatics operations in the database. This expands the current ecosystem of Open Source database cartridges (PGChem, MyChem, OrChem) which pretty much covers all the main RDBMSs (Postgres, MyQSL and Oracle). SciTouch have also provided a live instance of their database and associated cartridge, so you can play with it without requiring a local Oracle install. (It’d be useful to provide some details of the hardware that the DB is running on, so that timing numbers get some context)
Some handy settings when running a query from the command line via sqlplus
set echo off set heading on set linesize 1024 set pagesize 0 set tab on set trims on set wrap off
-- might want to set column formats here
-- e.g.: column foo format A10
spool stats -- dump results to stats.lst
-- SQL query here
It’s been a while since my last post, but I’m getting up to speed at work. It’s been less than a month, but there’s already a ton of cool stuff going on. One of the first things I’ve been getting to grips with is the data infrastructure at the NCGC, which is based around Oracle. One of my main projects is handling informatics for RNAi screening. As the data comes out of the pilots, they get loaded into the Oracle infrastructure.
Being an R aficionado, I’m doing the initial, exploratory analyses (normalization, hit selection, annotation etc.) using R. Thus I needed to have a way to access an Oracle DB from R. This is supported by the ROracle package. But it turns out that the installation is a little non-obvious and I figured I’d describe the procedure (on OS X 10.5) for posterity.
The first thing to do is to get Oracle from here. Note that this is the full Oracle installation and while it comes with 32 bit and 64 bit libraries, some of the binaries that are required during the R install are 64 bit only. After getting the zip file, extract the installation files and run the installation script. Since I just needed the libraries (as opposed to running an actual Oracle DB), I just went with the defaults and opted out of the the actual DB creation step. After installation is done, it’s useful to set the following environment variables:
With Oracle installed, execute the following
This will link a variety of object files into a library, which is required by the R package, but doesn’t come in the default Oracle installation.
The next thing is to get a 64 bit version of R from here and simply install as usual. Note that this will require you to reinstall all your packages, if you had a previous version of R around. Specifically, before installing ROracle, make sure to install the DBI package.
After installing R, get the ROracle 0.5-9 source package. Since there’s no binary build for OS X, we have to compile it ourselves. Before building, I like to CHECK the package to make sure that all is OK. Thus, the sequence of commands is
tar -zxvf ROracle_0.5-9.tar.gz
R --arch x86_64 CMD CHECK ROracle
R --arch x86_64 CMD BUILD ROracle
R --arch x86_64 CMD INSTALL -l /Users/guhar/Library/R/2.9/library ROracle_0.5-9.tar.gz
When I ran the CHECK, I did get some warnings, but it seems to be safe to ignore them.
At this stage, the ROracle package should be installed and you can start R and load the package. Remember to start R with the –arch x86_64 argument, since the ROracle package will have been built for the 64 bit version of R.
There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.
At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.
But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?
While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.
But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.
Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?
Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3 million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.
So this leads to the question: what is big enough to make use of the cloud?
What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.
So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include
- Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
- Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
- Many chemistry problems are not large in terms of data size, compared to other life science applications
- Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)
Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.