So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘gis’ tag

ChEMBL in RDF and Other Musings

with one comment

Earlier today, Egon announced the release of an RDF version of ChEMBL, hosted at Uppsala. A nice feature of this setup is that one can play around with the data via SPARQL queries as well as explore the classes and properties that the Uppsala folks have implemented. Having fiddled with SPARQL on and off, it was nice to play with ChEMBL since it contains such a wide array of data types. For example,  find articles referring to an assay (or experiment) run in mice targeting isomerases:

1
2
3
4
5
6
7
8
9
10
PREFIX  chembl:  <http://rdf.farmbio.uu.se/chembl/onto/#>
SELECT DISTINCT ?x ?pmid ?pdesc ?DESC WHERE {
?protein chembl:hasKeyword "Isomerase" .
?x chembl:hasTarget ?protein .
?protein chembl:hasDescription ?pdesc .
?x chembl:organism  "Mus musculus" .
?x chembl:hasDescription ?DESC .
?x chembl:extractedFrom ?resource .
?resource <http://purl.org/ontology/bibo/pmid> ?pmid
}

I’ve been following the discussion on RDF and Semantic Web for some time. While I can see a number of benefits from this approach, I’ve never been fully convinced as to the utility. In too many cases, the use cases I’ve seen (such as the one above) could have been done relatively trivially via traditional SQL queries. There hasn’t been a really novel use case that leads to ‘Aha! So that’s what it’s good for’

Egons’ announcement today, led to a discussion on FriendFeed. I think I finally got the point that SPARQL queries are not magic and could indeed be replaced by traditional SQL. The primary value in RDF is the presence of linked data – which is slowly accumulating in the life sciences (cf. LODD and Bio2RDF).

Of the various features of RDF that I’ve heard about, the ability to define and use equivalence relationships seems very useful. I can see this being used to jump from domain to domain by recognizing properties that are equivalent across domains. Yet, as far as I can tell, this requires that somebody defines these equivalences manually. If we have to do that, one could argue that it’s not really different from defining a mapping table to link two RDBMS’s.

But I suppose in the end what I’d like to see is using all this RDF data to perform automated or semi-automated inferencing. In other words, what non-obvious relationships can be draw from a collection of facts and relationships? In absence of that, I am not necessarily pulling out a novel relationship (though I may be pulling out facts that I did not necessarily know) by constructing a SPARQL query. Is such inferencing even possible?

On those lines, I considered an interesting set of linked data – could we generate a geographically annotated version of PubMed. Essentially, identify a city and country for each PubMed ID. This could be converted to RDF and linked to other sources. One could start asking questions such as are people around me working on a certain topic? or what proteins are the focus of research in region X? Clearly, such a dataset does not require RDF per se. But given that geolocation data is qualitatively different from say UniProt ID’s and PubMed ID’s, it’d be interesting to see whether anything came of this. As a first step, here’s BioPython code to retrieve the Affiliation field from PubMed entries from 2009 and 2010.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from Bio import Entrez

startYear = 2009
endYear = 2010

Entrez.email = "some@email.id"
h = Entrez.esearch(db='pubmed', term='%d:%d[dp]' % (startYear,endYear), retmax=1000000)
records = Entrez.read(h)['IdList']
print 'Got %d records' % (len(records))
o = open('geo.txt', 'w')
for pmid in records:
    print 'Processing PMID %s' % (pmid)
    hf = Entrez.efetch(db='pubmed', id=pmid, retmode='xml', rettype='full')
    details = Entrez.read(hf)[0]
    try:
        aff = details['MedlineCitation']['Article']['Affiliation']
    except KeyError:
        print '%s had no affiliation' % (pmid)
        continue
    try:
        o.write('%s\t%s\n' % (pmid, aff.encode('latin-1')))
    except UnicodeEncodeError:
        'Cant encode for %s' % (pmid)
        continue
o.close()

Using data from the National Geospatial Agency, it shouldn’t be too difficult to link PubMed ID’s to geography.

Written by Rajarshi Guha

February 10th, 2010 at 4:40 am