Wikipedia Category Hierarchy via N-triples

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

from ntriples import *
import sys

class Triple:
"""
A simplistic representation of a triple
"""
def __init__(self, s, p, o):
self._s = s
self._p = p
self._o = o
def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
def subject(self): return self._s
def predicate(self): return self._p
def object(self): return self._o

class MySink:
"""
This class stores the triples as they are parsed from the file
"""
def __init__(self):
self._triples = []

def triple(self, s, p, o):
self._triples.append( Triple(s,p,o) )

def __len__(self): return len(self._triples)

def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3

p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1	ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13

protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
c = filter(lambda x: x.object() == query.subject(), ts)
if len(c) == 0: return []
else:
ret = []
for i in c: ret.append( (i, recurseChildren(i)) )
return ret

root = []
for child in protein_children:
root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

Enzymes
Viral_enzymes
Receptors
Transmembrane_receptors
7TM_receptors
G_protein_coupled_receptors
Tyrosine_kinase_receptors
Ionotropic_receptors
Sensory_receptors
Photoreceptor_cells
Intracellular_receptors
Membrane_proteins
Integral_membrane_proteins
Peripheral_membrane_proteins
G_proteins
Lantibiotics
Protein_structure
Protein_structural_motifs
Protein_domains
Heat_shock_proteins
Glycoproteins
Serine_protease_inhibitors
Prions
Growth_factors
Lipoproteins
Cytokines
Protein_images
Metalloproteins
Iron-sulfur_proteins
Hemoproteins
Cytoskeleton
Motor_proteins
Structural_proteins
Keratins
Motor_proteins
Protein_methods
Structural_proteins
Keratins
Protein_domains
Cell_adhesion_proteins
Clusters_of_differentiation

So much to do, so little time

Trying to squeeze sense out of chemical data

Leave a Reply Cancel reply