So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for November, 2009

Are Bioinformatics Results Too Good To Be True?

with 2 comments

I came across an interesting paper by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue – “fishing for significance” and “publication bias”.

The former phenomenon is characterized by researchers identifying datasets on which their method works better than others or where a new method is (unconciously) optimized for  given set of datasets.  Then there is also the issue of validation of new methodologies, where she notes

… fitting a prediction model and estimating its error rate using the same training data set yields a downwardly biased error estimate commonly termed as ”apparent error”. Validation on independent fresh data is an important component of all prediction studies…

Boulesteix also points out that true, prospective validation is not always possible since the data may not be easily accessible to even available. She also notes that some of these problems could be mitigated by authors being very clear about the limitations and dataset assumptions they make. As I have been reading the microarray literature recently to help me with RNAi screening data, I have seen the problem firsthand. There are hundreds of papers on normalization techniques and gene selection methods. And each one claims to be better than the others. But in most cases, the improvements seem incremental. Is the difference really significant? It’s not always clear.

I’ll also note that this same problem is also likely present in the cheminformatics literature. There are any papers which claim that their SVM (or some other algorithm) implementation does better than previous reports on modeling something or the other. Is a 5% improvement really that good? Is it significant? Luckily there are recent efforts, such as SAMPL and the solubility challenge to address these issues in various areas of cheminformatics. Also, there is a nice and very simple metric recently developed to compare different methods (focusing on rankings generated by virtual screening methods).

The issue of publication bias also plays a role in this problem – negative results are difficult to publish and hence a researcher will try and find a positive spin on results that may not even be significant. For example, a well designed methodology paper will be difficult to publish if it cannot be shown to be better than other methods. One could get around such a rejection by cherry picking datasets (even when noting that such a dataset is cherry picked, it limits the utility of the paper in my opinion), or by avoiding comparisons with certain other methods. So while a researcher may end up with a paper, it’s more CV padding than an actual improvement in the state of the art.

But as Boulesteix notes, “a negative aspect … may be counterbalanced by positive aspects“. Thus even though a method might not provide better accuracy than other methods, it might be better suited for specific situations or may provide a new insight into the underlying problem or even highlight open questions.

While the observations in this paper are not new, they are well articulated and highlight the dangers that can arise from a publish-or-perish and positive-results-only system.

Written by Rajarshi Guha

November 29th, 2009 at 3:27 pm

New Version of rpubchem

without comments

Version 1.4.3 of rpubchem is out on CRAN. There’s some minor code cleanups and also a new function called which allows you to get assay ID’s based on whether they contain a compound (either as an active, inactive, discrepant or just tested). This uses PUG to perform the query, so can be a bit slow (and occasionally just fail).

Written by Rajarshi Guha

November 21st, 2009 at 2:11 am

Posted in software

Tagged with , ,

Frequency of a Term via PubMed

with 6 comments

A little while back, Egon posted a question on FriendFeed, asking whether there was an easy way, preferably a service, to determine and plot the usage count of a term in PubMed by year. This is simple enough using the Entrez Utilities CGI. A quick Python script to do this (with minimal error checking) is given below. It’d be relatively trivial to wrap this as a mod_python application and generate a bar plot directly (either using Python or using one of the online charting API’s)

import urllib
import xml.etree.ElementTree as ET

u = ""
term = "artemisinin resistance"
startYear = 1998
endYear = 2009
for year in range(startYear, endYear+1):
    url = u % (term.replace(" ", "+"), year, year)
    page = urllib.urlopen(url).read()
    doc = ET.XML(page)
    count = doc.find("Count").text
    print year, count

Update 1

A little more hacking and the above code was converted to a mod_python application, which can be accessed using a URL of the form With the help of the handy pygooglechart module, the above URL returns an <img> tag containing the appropriate Google Charts URL. As a an example, the term “artemisinin resistance” results in this image.

Update 2

Jan Schoones pointed out in a comment that my artemisinin resistance example was slightly incorrect, as the resultant PubMed search does not search for the exact phrase, but rather, looks for documents that contain the words “artemisinin” and “resistance”. This is because the example URL does not include the quotes around the phrase. A more correct example would be here, where we search for the phrase, rather than individual words.

Written by Rajarshi Guha

November 10th, 2009 at 11:50 pm

Posted in software

Tagged with , ,

Updated Versions of R Packages

with 4 comments

New versions of several of my R packages are now available on CRAN. rcdk 2.9.6 goes along with rcdklibs 1.2.3. The latter now uses the most recent cdk-1.2.x branch from Github. The former fixes a number of bugs relating to descriptor calculations, saving molecules in SD format and setting/getting properties on molecules. Unfortunately, because the 1.2.x branch does not have robust depiction code, the visualization methods in rcdk are currently disabled. The fingerprint package has also been updated and now includes a number of unit tests.

Written by Rajarshi Guha

November 6th, 2009 at 1:04 am

Posted in cheminformatics,software

Tagged with ,

Another Conference Done

without comments

The CHI RNAi conference is over and will now head back home. Being new to the field of RNAi screening, I’ve been looking for a place (virtual or real) where I can meet other people, especially those working in large scale screening facilities. Reading the literature is certainly useful, but face to face interactions are always richer. I was very pleased to see the meeting was of a high quality. While it wasn’t always cutting edge (most of the work had been published, but is still new to me) there were some very interesting talks ranging from the use of RNAi screens to probe myeloma biology, mTOR addiction and reconstruction of genetic networks to meta-analysis of multiple RNAi screens for the identification of synthetic lethal targets, parallel chemical and RNAi screens and the use of complex phenotypes and their analysis. Of course, a lot of it went over my head – but that was to be expected :) I was also pleasantly surprised to see very few vendor talks – the bulk of the talks were from academics or staff of core facilities..I also got to meet a number of people involved in RNAi screening facilities and had some very enlightening discussions. A lot of things to implement and test when I get back home! Overall a very useful meeting and I hope to make it again next year.

Now, just need to get home and schedule the ACS CINF program for the Spring meeting.

Written by Rajarshi Guha

November 3rd, 2009 at 7:02 pm

Posted in bioinformatics,research

Tagged with ,