So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘xml’ tag

Simple XML Parsing with Clojure

with 2 comments

A while back I had started playing with Clojure. It’s always been a spare-time hobby and not having had much spare time I haven’t really gotten as far ahead with it as I’d have liked. I’m still not sure why I like Clojure, but it is fun to code in. My interest was revitalized when I came across a Clojure group located in the D.C. area. So following on my previous post on geo-referencing PubMed articles, I decided to take a stab at doing the whole thing in Clojure.

One of the tasks in this project is to query PubMed using the EUtils CGIs and parse out the information from the XML document that is returned. It turns out that parsing XML documents or strings is very easy in Clojure.  The parse method in the clojure.xml namespace supports parsing of XML documents, returning a tree of tags. Using xml-zipper from the clojure.zip namespace creates a zipper data structure from the tree. Extracting specific elements is achieved by filtering the zipper by the path to the desired element. It’s a lot like the ElementTree module in Python (but doesn’t require that I insert namespaces before each and every element in the path!). We start of by working in our own namespace and then importing the relevant packages

1
2
3
4
(ns entrez
  (:require [clojure.xml :as xml])
  (:require [clojure.zip :as zip])
  (:require [clojure.contrib.zip-filter.xml :as zf]))

Next we define some helper methods

1
2
3
4
5
6
7
8
9
(defn get-ids [zipper]
  "Extract specific elements from an XML document"
  (zf/xml-> zipper :IdList :Id zf/text))

(defn get-affiliations [zipper]
  "Extract affiliations from PubMed abstracts"
  (map (fn [x y] (list x y))
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :PMID zf/text)
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :Article :Affiliation zf/text)))

Finally, we can get the ID’s from an esearch query by saving the results to a file and then running

1
2
3
(println (get-ids
      (zip/xml-zip
       (xml/parse "esearch.xml"))))

or extract affiliations from a set of PubMed abstracts obtained via an efetch query

1
2
3
(println (get-affiliations
      (zip/xml-zip
       (xml/parse "efetch.xml"))))

In the next post I’ll show some code to actually perform the queries via EUtils so that we don’t need to save results to files.

Written by Rajarshi Guha

February 17th, 2010 at 3:30 am

Posted in software,Uncategorized

Tagged with , , ,