Archive for February, 2009
Being an R aficionado, I do the bulk of my work in R and having grown up with Emacs I tend to dislike having to exit my environment to do “other” stuff. This was the motivation for integrating R and the CDK, so that I could access and manipulate chemical information from within my R session. This resulted in the rcdk package.
Since then there have been a lot of improvements in the CDK and so the latest version (2.9.2) of rcdk includes them and also provides access to much more of the CDK via R idioms. As the original J. Stat. Soft. paper is now pretty much deprecated, we have included a tutorial in the form of a vignette. The latest version of rcdk is now much smaller, since we have split out the actual CDK libraries into a separate package called rcdklibs. This allows us to release new versions of rcdk, without requiring a bulky download each time, since rcdklibs should change at a slower pace. I’d also like to thank Miguel Rojas Cherto for his contributions to this version of rcdk (as well as to rpubchem).
So what can you do with rcdk? Installation is pretty simple – just point your favorite interface to CRAN (or a mirror) and it should get it along with all the dependencies. After loading the library, you can read in any file format that the CDK supports or directly parse a SMILES
mols <- load.molecules("mymols.sdf")
mol.smiles <- parse.smiles("CC(=O)Cc1cc(Cl)ccc1")
which gives you a list of molecule objects. Note that these objects are actually pointers to Java objects and so you can’t serialize these via R’s save command. This is a pain and so I’m planning to implement some code generators that will create S4 classes directly from the Java class definitions.
Once you have a molecule object you can do a variety of things:
## view molecule depictions
## evaluate fingerprints
fps <- get.fingerprints(mols, type="maccs")
## generate descriptors
dnames <- get.desc.names("topological")
descs <- eval.desc(mols, dnames)
One problem with the depiction code is that it does not work well on OS X. This is due to interactions between rJava and the R event handling loop. As a result, depictions show up, but then you can’t interact with the window. It does work fine on Linux and Windows. To easily handle fingerprints, I suggest the use of the fingerprint package. There are also methods to easily access atoms, bonds, molecule properties and so on.
In my previous post I talked mainly about why there isn’t a large showing of chemistry in the cloud. It was based of Deepaks post and a FriendFeed thread, but really only addressed the first two words of the title. The issue of collaboration came up in the FriendFeed thread via some comments from Matthew Todd. He asked
I am also interested in why there are so few distributed chemistry collaborations – i.e. those involving the actual synthesis of chemical compounds and their evaluation. Does it come down to data sharing tools?
The term “distributed chemistry collaborations” arises, partly, from a recent paper. But one might say that the idea of distributed collaborations is already here. Chemists have been collaborating in variety of ways, though many of these collaborations are small and focused (say between two or three people).
I get the feeling that Matthew is talking about larger collaborations, something on the lines of the CombiUgi project or the ONS Challenge. I think there are a number of factors that might explain why we don’t see more such large, distributed chemistry collaborations.
First, there is the issue of IP and credit. How will it get apportioned? If each collaborator is providing a specific set of skills, I can see it being relatively simple. But then it also sounds like pretty much any current collaboration. What happens when multiple people are synthesizing different compounds? And you have multiple people doing assays? How is work dividied? How is credit received? And are large, loosely managed groups even efficient? Of course, one could compare the scenario to many large Open Source projects and their management issues.
Second, I think data sharing tools are a factor. How do collaborations (especially those without an informatics component) efficiently share information? Probably Excel – but there are a number of efforts such as CDD and ChemSpider which are making it much easier for chemists to share chemical information.
A third factor that is somewhat related to the previous point is that academic chemistry has somewhat ignored the informatics aspects of chemistry (both as infrastructure topic as well as a research area). I think this is partly related to the scale of academic chemistry. Certainly, many topics in chemical research do not require informatics capabilities (compared to say ab initio computational capabilities). But there are a number of areas, such as the type that Matthew notes, that can greatly benefit from an efficient informatics infrastructure. I certainly won’t say that it’s all there and ready to use – but I think it’s important cheminformatics plays a role. In this sense, one could say that there would be many more distributed collaborations, if the chemists knew that there was an infrastructure that could help their efforts. I will also note that it’s not just about infrastructure – while important, it’s also pretty straightforward IT (given some domain knowledge). I do think that there is a lot more to cheminformatics than just setting up databases, that can support bench chemistry efforts. Industry realizes this. Academia hasn’t so much (at least yet).
Which leads me to the fourth factor, which is social. Maybe the reason for the lack of such collaborations is there chemists just don’t have a good way of getting the word out they are available and/or interested. Certainly, things like FriendFeed are a venue for things like this to happen, but given that most academic chemists are conservative, it may take time for this to pick up speed.
There’s been an interesting discussion sparked by Deepaks post, asking why there is a much smaller showing of chemists and chemistry applications in the cloud compared to other life science areas. This post led to a FriendFeed thread that raised a number of issues.
At a high level one can easily point out factors such as licensing costs for the tools to do chemistry in the cloud, lack of standards in data sets and formats and so on. As Joerg pointed out in the FF thread, IP issues and security are major factors. Even though I’m not a cloud expert, I have read and heard of various cases where financial companies are using clouds. Whether their applications involves sensitive data I don’t know, but it seems that this is one area that is addressable (if not already addressed). As a side note, I was interested in seeing that Lilly seems to be making a move towards an Amazon based cloud infrastructure.
But when I read Deepaks post, the question that occurred to me was: what is the compelling chemistry application that would really make use of the cloud?
While things like molecular dynamics are not going to run too well on a cloud set up, problems that are data parallel can make excellent use of such a set up. Given that, some immediate applications include docking, virtual screening and so on. There have been a number of papers talking about the use of Grids for docking, so one could easily consider docking in the cloud. Virtual screening (using docking, machine learning etc) would be another application.
But the problem I see facing these efforts is that they tend to be project specific. In contrast doing something like BLAST in the cloud is more standardized – you send in a sequence and compare it to the usual standard databases of sequences. On the other hand, each docking project is different, in terms of receptor (though there’s less variation) and ligand libraries. So on the chemistry side, the input is much larger and more variable.
Similarity searching is another example – one usually searches against a public database or a corporate collection. If these are not in the cloud, making use of the cloud is not very practical. Furthermore, how many different collections should be stored and accessed in the cloud?
Following on from this, one could ask, are chemistry datasets really that large? I’d say, no. But I qualify this statement by noting that many projects are quite specific – a single receptor of interest and some focused library. Even if that library is 2 or 3 million compounds, it’s still not very large. For example, while working on the Ugi project with Jean-Claude Bradley I had to dock 500,000 compounds. It took a few days to set up the conformers and then 1.5 days to do the docking, on 8 machines. With the conformers in hand, we can rapidly redock against other targets. But 8 machines is really small. Would I want to do this in the cloud? Sure, if it was set up for me. But I’d still have to transfer 80GB of data (though Amazon has this now). So the data is not big enough that I can’t handle it.
So this leads to the question: what is big enough to make use of the cloud?
What about really large structure databases? Say PubChem and ChemSpider? While Amazon has made progress in this direction by hosting PubChem, chemistry still faces the problem that PubChem is not the whole chemical universe. There will invariably be portions of chemical space that are not represented in a database. On the other hand a community oriented database like ChemSpider could take on this role – it already contains PubChem, so one could consider groups putting in their collections of interest (yes, IP is an issue but I can be hopeful!) and expanding the coverage of chemical space.
So to summarize, why isn’t there more chemistry in the cloud? Some possibilities include
- Chemistry projects tend to be specific, in the sense that there aren’t a whole lot of “standard” collections
- Large structure databases are not in the cloud and if they are, still do not cover the whole of chemical space
- Many chemistry problems are not large in terms of data size, compared to other life science applications
- Cheminformatics is a much smaller community than bioinformatics, though is applies mainly to non-corporate settings (where the reverse is likely true)
Though I haven’t explicitly talked about the tools – that certainly plays a factor. While there are a number of Open Source solutions to various cheminformatics problems, many people use commercial tools and will want to use them in the cloud. So one factor that will need to be addressed is the vendors coming on board and supporting cloud style setups.
Recently Syed Asad Rahman of the EBI sent around a message regarding some new code that he has been developing to perform maximum common substructure detection. It employs the CDK but also includes some alternative methods, which allow it to handle some cases for which the CDK takes forever. An example is determining the MCSS between the two molecules below. The new code from Syed detects the MCSS (though it takes a few seconds) whereas the CDK does not return after 1 minute.
While the sources haven’t been released yet, he has made a JAR file available along with some example code. It’s still in beta so there are some rough edges to the API as well as some edge cases that haven’t been caught. But overall, it seems to do a pretty good job for all the test cases I’ve thrown at it. To put it through its paces I put together a simple GUI to allow visual inspection of the resultant MCSS’s. It’s large because I bundled the entire CDK with it. Also, the substructure rendering is a little crude but does the job.
A major component of QSAR modeling is the choice of molecular descriptors that are used in a model. The literature is replete with descriptors and there’s lots of software (commercial and open source) to calculate them. There are many issues related to molecules descriptors, (such as many descriptors being correlated and so on) but I came across a paper by Frank Burden and co-workers describing a “universal descriptor”. What is such a descriptor?
The idea derives from the fact that molecular descriptors usually characterize one specific structural feature. But in many cases, the biological activity of a molecule is a function of multiple structural features. This implies that you need multiple descriptors to capture the entire structure-activity relationship. The goal of a universal descriptor set is that it should be able to characterize a molecular structure in such a way that it (implicitly or explicitly) encodes all the structural features that might be relevant for the molecules activity in multiple, diverse scenarios. In other words, a true universal descriptor set could be used in a variety QSAR models and not require additional descriptors.
One might ask whether this is feasible or not. But when we realize that in many cases biological activity is controlled by shape and electrostatics, it might make sense that a descriptor that characterizes these two features simultaneously should be a good candidate. Burden et al describe “charge fingerprints” which are claimed to be a step towards such a universal descriptor set.
These descriptors are essentially binned counts of partial charges on specific atoms. The method considers 7 atoms (H, C, N, O, P, S, Si) and for each atom declares 3 bins. Then for a given molecule, one simply bins the partial charges on the atoms. This results in a 18-element descriptor vector which can then be used in QSAR modeling. This is a very simple descriptor to implement (the authors implementation is commercially available, as far as I can see). They test it out on several large and diverse datasets and also compare these descriptors to atom count descriptors and BCUT‘s.
The results indicate that while similar in performance to things like BCUT’s, in the end combinations of these charge fingerprints with other descriptors perform best. OK, so that seems to preclude the charge fingerprints being universal in nature. The fact that the number of bins is an empirical choice based on the datasets they employed also seems like a factor that prevents the from being universal descriptors. And, shape isn’t considered. Given this point, it would have been interesting to see how these descriptors comapred to CPSA‘s. So while simple, interpretable and useful, it’s not clear why these would be considered universal.