There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.
At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.
But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?
While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.
But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.
Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?
Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3 million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.
So this leads to the question: what is big enough to make use of the cloud?
What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.
So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include
- Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
- Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
- Many chemistry problems are not large in terms of data size, compared to other life science applications
- Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)
Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.