So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘pubmed’ tag

Author Count Frequencies in PubMed

without comments

Earlier today, Emily Wixson posted a question on the CHMINF-L list asking

… if there is any way to count the number of authors of papers with specific keywords in the title by year over a decade …

Since I had some code compiling and databases loading I took a quick stab, using Python and the Entrez services. The query provided by Emily was

1
(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])

The Python code to retrieve all the relevant PubMed ID’s and then process the PubMed entries to extract the article ID, year and number of authors is below. For some reason this query also retrieves articles from before 2001 and articles with no year or zero authors, but we can easily filter those entries out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import urllib, urllib2, sys
import xml.etree.ElementTree as ET

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

query = '(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])'

esearch = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=2001&maxdate=2010&retmode=xml&retmax=10000000&term=%s' % (query)
handle = urllib.urlopen(esearch)
data = handle.read()

root = ET.fromstring(data)
ids = [x.text for x in root.findall("IdList/Id")]
print 'Got %d articles' % (len(ids))

for group in chunker(ids, 100):
    efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % (','.join(group))
    handle = urllib.urlopen(efetch)
    data = handle.read()

    root = ET.fromstring(data)
    for article in root.findall("PubmedArticle"):
        pmid = article.find("MedlineCitation/PMID").text
        year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
        if year is None: year = 'NA'
        else: year = year.text
        aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
        print pmid, year, len(aulist)

With ID’s, year and author counts in hand, a bit of R lets us visualize the distribution of author counts, over the whole decade and also by individual years.

The median author count is 4, There are a number of papers that have more than 15 and some single papers with more than 35 authors on them. If we exclude papers with, say more than 20 authors and view the distribution by year we get the following set of histograms

We can see that over the years, while the median number of authors on a paper is more or less constant at 4 and increases to 5 in 2009 and 2010. But at the same time, the distribution does grow broader over the years, indicating that there is an increasing number of papers with larger author counts.

Anyway, this was a quick hack, and there are probably more rigorous ways to do this (such as using Web of Science – but automating that would be painful).

Written by Rajarshi Guha

September 15th, 2010 at 2:19 am

Posted in software

Tagged with , , ,

Simple XML Parsing with Clojure

with 2 comments

A while back I had started playing with Clojure. It’s always been a spare-time hobby and not having had much spare time I haven’t really gotten as far ahead with it as I’d have liked. I’m still not sure why I like Clojure, but it is fun to code in. My interest was revitalized when I came across a Clojure group located in the D.C. area. So following on my previous post on geo-referencing PubMed articles, I decided to take a stab at doing the whole thing in Clojure.

One of the tasks in this project is to query PubMed using the EUtils CGIs and parse out the information from the XML document that is returned. It turns out that parsing XML documents or strings is very easy in Clojure.  The parse method in the clojure.xml namespace supports parsing of XML documents, returning a tree of tags. Using xml-zipper from the clojure.zip namespace creates a zipper data structure from the tree. Extracting specific elements is achieved by filtering the zipper by the path to the desired element. It’s a lot like the ElementTree module in Python (but doesn’t require that I insert namespaces before each and every element in the path!). We start of by working in our own namespace and then importing the relevant packages

1
2
3
4
(ns entrez
  (:require [clojure.xml :as xml])
  (:require [clojure.zip :as zip])
  (:require [clojure.contrib.zip-filter.xml :as zf]))

Next we define some helper methods

1
2
3
4
5
6
7
8
9
(defn get-ids [zipper]
  "Extract specific elements from an XML document"
  (zf/xml-> zipper :IdList :Id zf/text))

(defn get-affiliations [zipper]
  "Extract affiliations from PubMed abstracts"
  (map (fn [x y] (list x y))
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :PMID zf/text)
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :Article :Affiliation zf/text)))

Finally, we can get the ID’s from an esearch query by saving the results to a file and then running

1
2
3
(println (get-ids
      (zip/xml-zip
       (xml/parse "esearch.xml"))))

or extract affiliations from a set of PubMed abstracts obtained via an efetch query

1
2
3
(println (get-affiliations
      (zip/xml-zip
       (xml/parse "efetch.xml"))))

In the next post I’ll show some code to actually perform the queries via EUtils so that we don’t need to save results to files.

Written by Rajarshi Guha

February 17th, 2010 at 3:30 am

Posted in software,Uncategorized

Tagged with , , ,

Frequency of a Term via PubMed

with 6 comments

A little while back, Egon posted a question on FriendFeed, asking whether there was an easy way, preferably a service, to determine and plot the usage count of a term in PubMed by year. This is simple enough using the Entrez Utilities CGI. A quick Python script to do this (with minimal error checking) is given below. It’d be relatively trivial to wrap this as a mod_python application and generate a bar plot directly (either using Python or using one of the online charting API’s)

1
2
3
4
5
6
7
8
9
10
11
12
13
import urllib
import xml.etree.ElementTree as ET

u = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=%s&mindate=%d/01/01&maxdate=%d/12/31"
term = "artemisinin resistance"
startYear = 1998
endYear = 2009
for year in range(startYear, endYear+1):
    url = u % (term.replace(" ", "+"), year, year)
    page = urllib.urlopen(url).read()
    doc = ET.XML(page)
    count = doc.find("Count").text
    print year, count

Update 1

A little more hacking and the above code was converted to a mod_python application, which can be accessed using a URL of the form http://rest.rguha.net/usage/usage.py?term=TERM&syear=1997&eyear=2009. With the help of the handy pygooglechart module, the above URL returns an <img> tag containing the appropriate Google Charts URL. As a an example, the term “artemisinin resistance” results in this image.

Update 2

Jan Schoones pointed out in a comment that my artemisinin resistance example was slightly incorrect, as the resultant PubMed search does not search for the exact phrase, but rather, looks for documents that contain the words “artemisinin” and “resistance”. This is because the example URL does not include the quotes around the phrase. A more correct example would be here, where we search for the phrase, rather than individual words.

Written by Rajarshi Guha

November 10th, 2009 at 11:50 pm

Posted in software

Tagged with , ,