BAZOO

So much to do, so little time

Trying to squeeze sense out of chemical data

Chemistry, Clouds, Collaboration (Part 1)

with 4 comments

There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.

At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.

But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?

While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.

But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.

Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?

Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3  million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.

So this leads to the question: what is big enough to make use of the cloud?

What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.

So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include

  • Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
  • Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
  • Many chemistry problems are not large in terms of data size, compared to other life science applications
  • Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)

Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.

Written by Rajarshi Guha

February 22nd, 2009 at 5:00 pm

4 Responses to 'Chemistry, Clouds, Collaboration (Part 1)'

Subscribe to comments with RSS or TrackBack to 'Chemistry, Clouds, Collaboration (Part 1)'.

  1. Interesting discussion. A few weeks ago, a company called Cirrhus9 (http://www.cirrhus9.com/) held a presentation on cloud computing and pharma in San Diego (I had planned on going but wasn’t able to make it). Apparently, the company will be holding similar events every several weeks around the city.

    I talked to a representative of Cirrhus9 who had some interesting ideas about the ways pharma companies could cut costs by moving computationally-intensive calculation packages onto the cloud (vs. hosing on their own clusters).

    IMO, cost-cutting will drive a lot of the initial demand for cloud resources as companies of all sizes look for ways to save money any way they can.

    So it may not only be a case of identifying what’s big enough for the cloud, but also what’s expensive enough (in total cost) to be moved into the cloud.

    Rich Apodaca

    22 Feb 09 at 5:20 pm

  2. [...] my previous post I talked mainly about why there isn’t a large showing of chemistry in the cloud. It was based [...]

  3. Rich, yes that’s a very good point. Given that a cloud is simply someone else managing the hardware, I think the cost issues will be a major factor in the decision to move or note move to a cloud setup

    Rajarshi Guha

    22 Feb 09 at 5:38 pm

  4. [...] led to a post on Friendfeed. That’s becoming a fairly active discussion and led to a couple of posts by [...]

Leave a Reply