So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘rdf’ tag

Wikipedia Category Hierarchy via N-triples

without comments

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I ¬†came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ntriples import *
import sys

class Triple:
    """
     A simplistic representation of a triple
    """

    def __init__(self, s, p, o):
        self._s = s
        self._p = p
        self._o = o
    def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
    def subject(self): return self._s
    def predicate(self): return self._p
    def object(self): return self._o

class MySink:
    """
     This class stores the triples as they are parsed from the file
    """

    def __init__(self):
        self._triples = []

    def triple(self, s, p, o):
        self._triples.append( Triple(s,p,o) )
       
    def __len__(self): return len(self._triples)

    def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3
p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1
ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13
protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
    c = filter(lambda x: x.object() == query.subject(), ts)
    if len(c) == 0: return []
    else:
        ret = []
        for i in c: ret.append( (i, recurseChildren(i)) )
        return ret

root = []
for child in protein_children:
    root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  Enzymes
      Viral_enzymes
  Receptors
      Transmembrane_receptors
          7TM_receptors
              G_protein_coupled_receptors
          Tyrosine_kinase_receptors
          Ionotropic_receptors
      Sensory_receptors
          Photoreceptor_cells
      Intracellular_receptors
  Membrane_proteins
      Integral_membrane_proteins
      Peripheral_membrane_proteins
          G_proteins
          Lantibiotics
  Protein_structure
      Protein_structural_motifs
      Protein_domains
  Heat_shock_proteins
  Glycoproteins
  Serine_protease_inhibitors
  Prions
  Growth_factors
  Lipoproteins
  Cytokines
  Protein_images
  Metalloproteins
      Iron-sulfur_proteins
      Hemoproteins
  Cytoskeleton
      Motor_proteins
      Structural_proteins
          Keratins
  Motor_proteins
  Protein_methods
  Structural_proteins
      Keratins
  Protein_domains
  Cell_adhesion_proteins
  Clusters_of_differentiation

Written by Rajarshi Guha

January 2nd, 2011 at 7:28 am

Posted in software

Tagged with , , , ,

ChEMBL in RDF and Other Musings

with one comment

Earlier today, Egon announced the release of an RDF version of ChEMBL, hosted at Uppsala. A nice feature of this setup is that one can play around with the data via SPARQL queries as well as explore the classes and properties that the Uppsala folks have implemented. Having fiddled with SPARQL on and off, it was nice to play with ChEMBL since it contains such a wide array of data types. For example,  find articles referring to an assay (or experiment) run in mice targeting isomerases:

1
2
3
4
5
6
7
8
9
10
PREFIX  chembl:  <http://rdf.farmbio.uu.se/chembl/onto/#>
SELECT DISTINCT ?x ?pmid ?pdesc ?DESC WHERE {
?protein chembl:hasKeyword "Isomerase" .
?x chembl:hasTarget ?protein .
?protein chembl:hasDescription ?pdesc .
?x chembl:organism  "Mus musculus" .
?x chembl:hasDescription ?DESC .
?x chembl:extractedFrom ?resource .
?resource <http://purl.org/ontology/bibo/pmid> ?pmid
}

I’ve been following the discussion on RDF and Semantic Web for some time. While I can see a number of benefits from this approach, I’ve never been fully convinced as to the utility. In too many cases, the use cases I’ve seen (such as the one above) could have been done relatively trivially via traditional SQL queries. There hasn’t been a really novel use case that leads to ‘Aha! So that’s what it’s good for’

Egons’ announcement today, led to a discussion on FriendFeed. I think I finally got the point that SPARQL queries are not magic and could indeed be replaced by traditional SQL. The primary value in RDF is the presence of linked data – which is slowly accumulating in the life sciences (cf. LODD and Bio2RDF).

Of the various features of RDF that I’ve heard about, the ability to define and use equivalence relationships seems very useful. I can see this being used to jump from domain to domain by recognizing properties that are equivalent across domains. Yet, as far as I can tell, this requires that somebody defines these equivalences manually. If we have to do that, one could argue that it’s not really different from defining a mapping table to link two RDBMS’s.

But I suppose in the end what I’d like to see is using all this RDF data to perform automated or semi-automated inferencing. In other words, what non-obvious relationships can be draw from a collection of facts and relationships? In absence of that, I am not necessarily pulling out a novel relationship (though I may be pulling out facts that I did not necessarily know) by constructing a SPARQL query. Is such inferencing even possible?

On those lines, I considered an interesting set of linked data – could we generate a geographically annotated version of PubMed. Essentially, identify a city and country for each PubMed ID. This could be converted to RDF and linked to other sources. One could start asking questions such as are people around me working on a certain topic? or what proteins are the focus of research in region X? Clearly, such a dataset does not require RDF per se. But given that geolocation data is qualitatively different from say UniProt ID’s and PubMed ID’s, it’d be interesting to see whether anything came of this. As a first step, here’s BioPython code to retrieve the Affiliation field from PubMed entries from 2009 and 2010.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from Bio import Entrez

startYear = 2009
endYear = 2010

Entrez.email = "some@email.id"
h = Entrez.esearch(db='pubmed', term='%d:%d[dp]' % (startYear,endYear), retmax=1000000)
records = Entrez.read(h)['IdList']
print 'Got %d records' % (len(records))
o = open('geo.txt', 'w')
for pmid in records:
    print 'Processing PMID %s' % (pmid)
    hf = Entrez.efetch(db='pubmed', id=pmid, retmode='xml', rettype='full')
    details = Entrez.read(hf)[0]
    try:
        aff = details['MedlineCitation']['Article']['Affiliation']
    except KeyError:
        print '%s had no affiliation' % (pmid)
        continue
    try:
        o.write('%s\t%s\n' % (pmid, aff.encode('latin-1')))
    except UnicodeEncodeError:
        'Cant encode for %s' % (pmid)
        continue
o.close()

Using data from the National Geospatial Agency, it shouldn’t be too difficult to link PubMed ID’s to geography.

Written by Rajarshi Guha

February 10th, 2010 at 4:40 am