For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.
I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | from ntriples import * import sys class Triple: """ A simplistic representation of a triple """ def __init__(self, s, p, o): self._s = s self._p = p self._o = o def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o) def subject(self): return self._s def predicate(self): return self._p def object(self): return self._o class MySink: """ This class stores the triples as they are parsed from the file """ def __init__(self): self._triples = [] def triple(self, s, p, o): self._triples.append( Triple(s,p,o) ) def __len__(self): return len(self._triples) def getTriples(self): return self._triples |
Loading in the triples is then as simple as
1 2 3 | p = NTriplesParser(sink=MySink()) sink = p.parse(open(sys.argv[1])) ts = sink.getTriples() |
This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter
1 | ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts) |
With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.
1 2 3 4 5 6 7 8 9 10 11 12 13 | protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts) def recurseChildren(query): c = filter(lambda x: x.object() == query.subject(), ts) if len(c) == 0: return [] else: ret = [] for i in c: ret.append( (i, recurseChildren(i)) ) return ret root = [] for child in protein_children: root.append( (child, recurseChildren(child)) ) |
Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | Enzymes Viral_enzymes Receptors Transmembrane_receptors 7TM_receptors G_protein_coupled_receptors Tyrosine_kinase_receptors Ionotropic_receptors Sensory_receptors Photoreceptor_cells Intracellular_receptors Membrane_proteins Integral_membrane_proteins Peripheral_membrane_proteins G_proteins Lantibiotics Protein_structure Protein_structural_motifs Protein_domains Heat_shock_proteins Glycoproteins Serine_protease_inhibitors Prions Growth_factors Lipoproteins Cytokines Protein_images Metalloproteins Iron-sulfur_proteins Hemoproteins Cytoskeleton Motor_proteins Structural_proteins Keratins Motor_proteins Protein_methods Structural_proteins Keratins Protein_domains Cell_adhesion_proteins Clusters_of_differentiation |