So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘wikipedia’ tag

Wikipedia Category Hierarchy via N-triples

without comments

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I ┬ácame across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ntriples import *
import sys

class Triple:
    """
     A simplistic representation of a triple
    """

    def __init__(self, s, p, o):
        self._s = s
        self._p = p
        self._o = o
    def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
    def subject(self): return self._s
    def predicate(self): return self._p
    def object(self): return self._o

class MySink:
    """
     This class stores the triples as they are parsed from the file
    """

    def __init__(self):
        self._triples = []

    def triple(self, s, p, o):
        self._triples.append( Triple(s,p,o) )
       
    def __len__(self): return len(self._triples)

    def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3
p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1
ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13
protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
    c = filter(lambda x: x.object() == query.subject(), ts)
    if len(c) == 0: return []
    else:
        ret = []
        for i in c: ret.append( (i, recurseChildren(i)) )
        return ret

root = []
for child in protein_children:
    root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  Enzymes
      Viral_enzymes
  Receptors
      Transmembrane_receptors
          7TM_receptors
              G_protein_coupled_receptors
          Tyrosine_kinase_receptors
          Ionotropic_receptors
      Sensory_receptors
          Photoreceptor_cells
      Intracellular_receptors
  Membrane_proteins
      Integral_membrane_proteins
      Peripheral_membrane_proteins
          G_proteins
          Lantibiotics
  Protein_structure
      Protein_structural_motifs
      Protein_domains
  Heat_shock_proteins
  Glycoproteins
  Serine_protease_inhibitors
  Prions
  Growth_factors
  Lipoproteins
  Cytokines
  Protein_images
  Metalloproteins
      Iron-sulfur_proteins
      Hemoproteins
  Cytoskeleton
      Motor_proteins
      Structural_proteins
          Keratins
  Motor_proteins
  Protein_methods
  Structural_proteins
      Keratins
  Protein_domains
  Cell_adhesion_proteins
  Clusters_of_differentiation

Written by Rajarshi Guha

January 2nd, 2011 at 7:28 am

Posted in software

Tagged with , , , ,