Wikipedia Category Hierarchy via N-triples

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I  came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

 1234567891011121314151617181920212223242526272829 from ntriples import * import sys class Triple:     """      A simplistic representation of a triple     """     def __init__(self, s, p, o):         self._s = s         self._p = p         self._o = o     def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)     def subject(self): return self._s     def predicate(self): return self._p     def object(self): return self._o class MySink:     """      This class stores the triples as they are parsed from the file     """     def __init__(self):         self._triples = []     def triple(self, s, p, o):         self._triples.append( Triple(s,p,o) )             def __len__(self): return len(self._triples)     def getTriples(self): return self._triples

 123 p = NTriplesParser(sink=MySink()) sink = p.parse(open(sys.argv[1])) ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

 1 ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

 12345678910111213 protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts) def recurseChildren(query):     c = filter(lambda x: x.object() == query.subject(), ts)     if len(c) == 0: return []     else:         ret = []         for i in c: ret.append( (i, recurseChildren(i)) )         return ret root = [] for child in protein_children:     root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

 1234567891011121314151617181920212223242526272829303132333435363738394041 Enzymes       Viral_enzymes   Receptors       Transmembrane_receptors           7TM_receptors               G_protein_coupled_receptors           Tyrosine_kinase_receptors           Ionotropic_receptors       Sensory_receptors           Photoreceptor_cells       Intracellular_receptors   Membrane_proteins       Integral_membrane_proteins       Peripheral_membrane_proteins           G_proteins           Lantibiotics   Protein_structure       Protein_structural_motifs       Protein_domains   Heat_shock_proteins   Glycoproteins   Serine_protease_inhibitors   Prions   Growth_factors   Lipoproteins   Cytokines   Protein_images   Metalloproteins       Iron-sulfur_proteins       Hemoproteins   Cytoskeleton       Motor_proteins       Structural_proteins           Keratins   Motor_proteins   Protein_methods   Structural_proteins       Keratins   Protein_domains   Cell_adhesion_proteins   Clusters_of_differentiation