So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘bibliometrics’ tag

Author Count Frequencies in PubMed

without comments

Earlier today, Emily Wixson posted a question on the CHMINF-L list asking

… if there is any way to count the number of authors of papers with specific keywords in the title by year over a decade …

Since I had some code compiling and databases loading I took a quick stab, using Python and the Entrez services. The query provided by Emily was

(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])

The Python code to retrieve all the relevant PubMed ID’s and then process the PubMed entries to extract the article ID, year and number of authors is below. For some reason this query also retrieves articles from before 2001 and articles with no year or zero authors, but we can easily filter those entries out.

import urllib, urllib2, sys
import xml.etree.ElementTree as ET

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

query = '(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])'

esearch = '' % (query)
handle = urllib.urlopen(esearch)
data =

root = ET.fromstring(data)
ids = [x.text for x in root.findall("IdList/Id")]
print 'Got %d articles' % (len(ids))

for group in chunker(ids, 100):
    efetch = "" % (','.join(group))
    handle = urllib.urlopen(efetch)
    data =

    root = ET.fromstring(data)
    for article in root.findall("PubmedArticle"):
        pmid = article.find("MedlineCitation/PMID").text
        year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
        if year is None: year = 'NA'
        else: year = year.text
        aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
        print pmid, year, len(aulist)

With ID’s, year and author counts in hand, a bit of R lets us visualize the distribution of author counts, over the whole decade and also by individual years.

The median author count is 4, There are a number of papers that have more than 15 and some single papers with more than 35 authors on them. If we exclude papers with, say more than 20 authors and view the distribution by year we get the following set of histograms

We can see that over the years, while the median number of authors on a paper is more or less constant at 4 and increases to 5 in 2009 and 2010. But at the same time, the distribution does grow broader over the years, indicating that there is an increasing number of papers with larger author counts.

Anyway, this was a quick hack, and there are probably more rigorous ways to do this (such as using Web of Science – but automating that would be painful).

Written by Rajarshi Guha

September 15th, 2010 at 2:19 am

Posted in software

Tagged with , , ,

Drug Discovery Trends In and From the Literature

with one comment

I came across a recent paper by Agarwal and Searls which describes a detailed bibliometric analysis of the scientific literature to identify and characterize specific research topics that appear to be drivers for drug discovery – i.e., research areas/topics in life sciences that exhibit significant activity and thus might be fruitful for drug discovery efforts. The analysis is based on PubMed abstracts, citation data and patents and addresses research topics ranging from very broad WHO disease categories (such as respiratory infectins and skin diseases) to more specific topics via MeSH headings and down to the level of indivudal genes and pathways. The details of the methodology are described in a separate paper, but this one provides some very interesting analyses and visualizations of drug discovery trends.

The authors start out by defining the idea of “push” and “pull”:

Unmet medical need and commercial potential, which are considered the traditional drivers of drug discovery, may be seen as providing ‘pull’ for pharmaceutical discovery efforts. Yet, if a disease area offers no ‘push’ in the form of new scientific opportunities, no amount of pull will lead to new drugs — at least not mechanistically novel ones …

The authors then describe how they characterized “push” and “pull”. The key premise of the paper is that publications rates are an overall measure of activity and when categorized by topic (disease area, target, pathway etc), represent the activity in that area. Of course, there are many factors that characterize why certain therapeutic or target areas receive more interest than others and the authors clearly state that even their concepts of “push” and “pull” likely overlap and thus are not independent.

For now I’ll just highlight some of their analysis from the “pull” category. For example, using therapeutic areas from a WHO list (that characterized disease burden) and PubMed searches to count the number of papers in a given area, they generated the plot below. The focus on global disease burden and developed world disease burden was based on the assumption that the former measures general medical need and the latter measures commercial interest.

Figure 2 from Agarwal and Searls

Figure 2 from Agarwal and Searls

Another interesting summary was the rate of change of publication in a given area (this time defined by MeSH heading). This analysis used a 3 year window over a 30 year period and highlighted some interesting short term trends – such as the spurt in publications for viral research around the early 80′s which likely corresponded to discovery of retroviral involvement in human disease and then later by the AIDS epidemic. The data is summarized in the figure below, where red corresponds to a spurt in activity and blue to a stagnation in publication activity.

Figure 4 from Agarwal and Searls

They also report analyses that focus on high impact papers, based on a few of the top tier journals (as identified by impact factor) – while their justification is reasonable, it does have downsides which the authors highlight. Their last analysis focuses on specific diseases (based on MeSH disease subcategories) and individual protein targets, comparing the rate of publications in the short term (2 year windows) versus medium term (5 year windows). The idea is that this allows one to identify areas (or targets etc) that exhibit consistent growth (or interest), accelerating and decelerating growth . The resultant plots are quite interesting – though I wonder about the noise involved when going to something as specific as individual targets (identified by looking for gene symbols and synonyms).

While bibliometric analysis is a high level approach the authors make a compelling case that the technique can identify and summarize the drug discovery research landscape. The authors do a great job in rationalizing much of their results. Of course, the analyses are not prospective (i.e., which area should we pour money into next?). I think one of the key features of the work is that it quantitatively characterizes research output and is able to link this to various other efforts (specific funding programs, policy etc) – with the caveat that there are many confounding factors and causal effects.

Written by Rajarshi Guha

October 31st, 2009 at 2:09 pm

Posted in Literature

Tagged with , ,

Locality of References in a Paper

without comments

The other day I was reading a paper and as is my habit, while reading I flip to see what papers are being cited. Since this was an ACS journal, the references are listed in the order that they occur in the text. When the authors were discussing a point in the paper, they’d usually include a number of references. Given the ordering of the references, this implies that related references are grouped together in the bibliography.

This set me thinking – given a set of references and their citations within a paper, we can capture relationships between the references in various ways. Most obviously, one might analyze the  cited papers (either in whole, or in part such as just the abstract or title) and draw conclusions.

However, the fact that the authors of the paper considered references X,  Y and Z to be related to a specific point already provides us with some information. Thus  in a bibliography where references are order based on first occurrence, can we use the “locality” of the references in the list to draw any conclusions? One could employ some form of a sliding window and look at groups of references. The key thing here would be to have a way to characterize a reference – so it’d probably require that you can access the title (or better, the abstract or full text) of the paper being cited. I will admit that I’m not sure what sort of conclusions one might draw from such an analysis – but it was interesting to observe “local behavior” in a list of references.

Not having followed work in bibliometrics, I’m sure someone has already thought of this and looked into it. If anybody has heard of stuff like this, I’d appreciate any pointers.

(Of course this is all moot, if we can’t easily access the paper itself)

Written by Rajarshi Guha

September 30th, 2008 at 7:40 pm