So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘go’ tag

PubChem Bioassay Annotation Poster

without comments

Sometime back I had described some work on the automated annotation of PubChem bioassays. The lack of annotations on the assays can make it difficult to integrate with other biological resources. Ideally, the bioassays would be manually annotated – however, it’s not a very exciting job. So, collaborating with Patrick Ruch and Julien Gobeill, we used their tool, GOCat, to automatically annotate the PubChem bioassay collection with GO terms. They recently presented a poster on this work at the 3rd International Biocuration Conference in Berlin.

Obviously, automated annotation will not be as good as expert, manual annotations. However it does a decent job and I think it’s in line with a recent post by Duncan Hull, where he quotes a paper from Google

The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available

While we’re not using the PubChem assay data directly for learning, the automated approach to annotations means that we can move on to stuff that can make use of them, rather than waiting on a full manual curation of the assay collection (which will likely supercede automated annotations, when it becomes available).

Written by Rajarshi Guha

April 21st, 2009 at 1:11 pm

Posted in cheminformatics,research

Tagged with , ,

Getting the GO into a Graph Data Structure

with 3 comments

Today while working on a project I needed to get access to the Gene Ontology hierarchy. While there a number of GO browsers such as Amigo, I needed access to the raw data to generate a graph that I could then slice and dice. A few minutes with Python led to a simple solution.

The program parses the OBO 1.2 formatted GO data file (either by directly downloading it or from a local file) and outputs a flat dictionary listing the term ID’s, names, namespace etc and a network representation of the GO hierarchy in ncol format. It uses a simple  (and relatively non-robust) class to represent the data as an undirected graph (not really correct), though it’d be easy to use something like igraph to start doing some real network analysis. It’s certainly not a comprehensive solution, but I thought I’d put it out there.

Written by Rajarshi Guha

January 31st, 2009 at 1:34 am

Posted in software

Tagged with , ,

Annotating Bioassays

with 2 comments

I’ve been working for some time with the PubChem Bioassay collection – a set of 1293 assays that cover a range of techniques (enzymatic, phenotypic etc.), targets and sizes (from 20 molecules to 200,000 molecules). In addition, some assays are primary, high-throughput assays whereas a number of them are smaller, confirmatory assays. While an extremely valuable collection, one of the drawbacks is the lack of curation. This has led to some people saying that the data is too noisy to be useful. Yes, the noise is a problem, but I think there’s still useful data to extract and model.

One of the problems that I have faced is that while one can perform a full text search for assays on PubChem, there is no form of annotations on the assays themselves. One effect of this is that it is difficult to link an assay to other biological resources (though for enzymatic assays, one can determine a Pubmed protein identifier). While working on my bioassay network project, I needed annotations and I didn’t want to do it manually.

Read the rest of this entry »

Written by Rajarshi Guha

January 25th, 2009 at 5:03 pm