Sometime back I had described some work on the automated annotation of PubChem bioassays. The lack of annotations on the assays can make it difficult to integrate with other biological resources. Ideally, the bioassays would be manually annotated – however, it’s not a very exciting job. So, collaborating with Patrick Ruch and Julien Gobeill, we used their tool, GOCat, to automatically annotate the PubChem bioassay collection with GO terms. They recently presented a poster on this work at the 3rd International Biocuration Conference in Berlin.
Obviously, automated annotation will not be as good as expert, manual annotations. However it does a decent job and I think it’s in line with a recent post by Duncan Hull, where he quotes a paper from Google
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available
While we’re not using the PubChem assay data directly for learning, the automated approach to annotations means that we can move on to stuff that can make use of them, rather than waiting on a full manual curation of the assay collection (which will likely supercede automated annotations, when it becomes available).