Archive for the ‘chemistry’ Category
In my previous post I talked mainly about why there isn’t a large showing of chemistry in the cloud. It was based of Deepaks post and a FriendFeed thread, but really only addressed the first two words of the title. The issue of collaboration came up in the FriendFeed thread via some comments from Matthew Todd. He asked
I am also interested in why there are so few distributed chemistry collaborations – i.e. those involving the actual synthesis of chemical compounds and their evaluation. Does it come down to data sharing tools?
The term “distributed chemistry collaborations” arises, partly, from a recent paper. But one might say that the idea of distributed collaborations is already here. Chemists have been collaborating in variety of ways, though many of these collaborations are small and focused (say between two or three people).
I get the feeling that Matthew is talking about larger collaborations, something on the lines of the CombiUgi project or the ONS Challenge. I think there are a number of factors that might explain why we don’t see more such large, distributed chemistry collaborations.
First, there is the issue of IP and credit. How will it get apportioned? If each collaborator is providing a specific set of skills, I can see it being relatively simple. But then it also sounds like pretty much any current collaboration. What happens when multiple people are synthesizing different compounds? And you have multiple people doing assays? How is work dividied? How is credit received? And are large, loosely managed groups even efficient? Of course, one could compare the scenario to many large Open Source projects and their management issues.
Second, I think data sharing tools are a factor. How do collaborations (especially those without an informatics component) efficiently share information? Probably Excel – but there are a number of efforts such as CDD and ChemSpider which are making it much easier for chemists to share chemical information.
A third factor that is somewhat related to the previous point is that academic chemistry has somewhat ignored the informatics aspects of chemistry (both as infrastructure topic as well as a research area). I think this is partly related to the scale of academic chemistry. Certainly, many topics in chemical research do not require informatics capabilities (compared to say ab initio computational capabilities). But there are a number of areas, such as the type that Matthew notes, that can greatly benefit from an efficient informatics infrastructure. I certainly won’t say that it’s all there and ready to use – but I think it’s important cheminformatics plays a role. In this sense, one could say that there would be many more distributed collaborations, if the chemists knew that there was an infrastructure that could help their efforts. I will also note that it’s not just about infrastructure – while important, it’s also pretty straightforward IT (given some domain knowledge). I do think that there is a lot more to cheminformatics than just setting up databases, that can support bench chemistry efforts. Industry realizes this. Academia hasn’t so much (at least yet).
Which leads me to the fourth factor, which is social. Maybe the reason for the lack of such collaborations is there chemists just don’t have a good way of getting the word out they are available and/or interested. Certainly, things like FriendFeed are a venue for things like this to happen, but given that most academic chemists are conservative, it may take time for this to pick up speed.
There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.
At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.
But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?
While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.
But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.
Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?
Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3 million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.
So this leads to the question: what is big enough to make use of the cloud?
What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.
So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include
- Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
- Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
- Many chemistry problems are not large in terms of data size, compared to other life science applications
- Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)
Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.
News of the ChemSpider Journal of Chemistry has been posted in various places. This effort is interesting as it is a combination of features that are currently available in different forms. Like other Open Access journals, the CJC will be follow the BOAI and hence be Open Access. In addition it will exhibit markup of the text, such as done by the RSC journals (which are not OA). I’m especially interested in this latter feature for automated processing of articles. While it is good to see the combination of these features, it also interesting to see that the journal will use a just-in-time (JIT) approach, and allow online peer review, commentaries. In this sense, it can be expected to be an especially good venue for ONS style projects.
I think this effort will be an interesting experiment, especially given that many “traditional” chemists may not have blogs and wiki’s to support a JIT approach, and that a journal might be more acceptable. I recently joined the editorial board. I’m eager to see how the journal evolves and am pleased to be able to contribute to this effort and encourages to do so as well.
In a previous post, I described a simple web form to query and visualize the solubility data being generated as part of the ONS Challenge. The previous approach required me to manually download the data and load it into a Postgres database. While trivial from a coding point of view, it’s a pain since I have to keep my local DB in sync with the Google Docs spreadsheet.
This is very nice since I now no longer have to maintain a local DB and ensure that it’s in sync with Jean-Claudes results. Of course, there are some drawbacks to this method. First, the query page will assume that the data in the spreadsheet is clean. So if there are two entries called “Ethanol” and “ethanol”, they will be considered seperate solvents. Secondly, this approach cannot be used to include cheminformatics in the queries, since Google doesn’t support that functionality. Finally, it’s not going to be very good for large spreadsheets.
However, this is a very nice API, that allows one to elegantly integrate web applications with live data. I heart Google!
The Curious Wavefunction has a nice post on the issue of selective and non-selective kinase inhibitors. An interesting commentary, especially in the light of the recent paper on network polypharmcology. While there have been a number of papers on polypharmcology and the idea itself is very attractive, it has seemed to me that for this approach to succeed we need very detailed information on the targets and systems involved in these networks. Indeed, a current project of mine is currently hitting this problem. As Ashutosh notes,
… in the first place we don’t even know what specific subset of kinases to hit for treating a particular disease. First comes target validation, then modulation.