Archive for the ‘chemistry’ tag
Rich Apodaca recently wrote a post highlighting StackOverflow – a community discussion site for software development, suggesting that a similar type of site for chemists would not work. He also posted a follow up listing some factors that make something like StackOverflow unlikely for the chemistry community. I had made a quick comment noting that one difference between the culture of the chemistry and software communities was possibilities of commercialization. On thinking about it a little, this is not entirely correct, as both communities generate ideas and work that lead to commercialization.
But I think that the difference lies in the nature of the commercialization process. As Rich pointed out in his followup post, entrepreneurship and resources are two important sources of differences between the chemistry and software communities. In the latter community, two people can implement an idea with minimal resource investment and end up with a profitable product. In contrast, two chemists might come up with an idea, but in many cases, it will require significant investment in resources to get an initial product (and scale up would be a separate issue).
In that sense, the process of commercialization in chemistry can be a longer process – and if that’s the case, it’s not surprising that we see the differences. In fact, if we’re comparing chemistry to some computer related field, it seems that a comparison with the computer hardware is more appropriate than computer software, especially when we consider the costs involved in the commercialization process. (Though with FPGA’s and chip fabs, computer hardware startups are probably easier than a chemistry startup).
Another factor that differentiates chemistry from computer software or hardware, is that chemistry projects are not usually spare time projects. One can write software or design (basic) hardware as a spare time thing which, if they turn out to feasible/useful/interesting can be transformed to an actual product. Again, this goes back to the costs involved in testing out and implementing new ideas without institutional backing.
Rich’s other points are also good and I think his comments on patents vs copyrights is especially important. However, I’m not so sure about the issue of history – obviously, history brings tradition (baggage?), but is this really a big factor? It seems that the implications of history overlap to a large degree with “established communication channels”
There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.
At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.
But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?
While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.
But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.
Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?
Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3 million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.
So this leads to the question: what is big enough to make use of the cloud?
What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.
So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include
- Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
- Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
- Many chemistry problems are not large in terms of data size, compared to other life science applications
- Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)
Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.
Houghten, R. et al, “Strategies for the Use of Mixture-Based Synthetic Combinatorial Libraries: Scaffold Ranking, Direct Testing In Vivo, and Enhanced Deconvolution by Computational Methods”, J. Comb. Chem., 2008, 10, 3-19
Recently a collaborator pointed me to the above article by Houghten and co-workers where they describe the use of mixture-based combinatorial libraries for high-throughput screening (HTS) experiments.
Traditionally an HTS experiment will screen thousands to millions of individual molecules. Obviously, it’s all done by robots so though you have to be careful during setup it’s not like you have to do it all by hand. But the fact is, if it’s possible to reduce the actual number of individual screens, life becomes easier and cheaper. Houghten et al describe an elegant approach that does just this.