So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘bioinformatics’ Category

RNAi in PubChem

without comments

While considering ways to disseminate RNAi screening data, I found out that PubChem now contains two RNAi screening datasets – AIDs 1622 and 1904. These screens reuse the PubChem bioaassay formats – which is both good and bad. For example, while there are a few standardized columns (such as PUBCHEM_ACTIVITY_SCORE), the bulk of the user deposited columns are not formally defined. In other words, you’d have to read the assay description. While not a huge deal, it would be nice if we could use pre-existing formats such as MIARE, analogous to MIAME for microarray data. That way we could determine the number of replicates, normalization method employed and other details of the screen. As far as I can tell all aspects an RNAi screen are still not fully defined in the MIAME vocabulary, and there don’t seem to be a whole lot of examples. But it’s a start.

But of course, nothing is perfect. Why, oh why, would a tab delimited format be contained within multiple worksheets of an Excel workbook!

Written by Rajarshi Guha

April 19th, 2010 at 12:19 am

Posted in bioinformatics

Tagged with , ,

ChEMBL in RDF and Other Musings

with one comment

Earlier today, Egon announced the release of an RDF version of ChEMBL, hosted at Uppsala. A nice feature of this setup is that one can play around with the data via SPARQL queries as well as explore the classes and properties that the Uppsala folks have implemented. Having fiddled with SPARQL on and off, it was nice to play with ChEMBL since it contains such a wide array of data types. For example,  find articles referring to an assay (or experiment) run in mice targeting isomerases:

PREFIX  chembl:  <>
?protein chembl:hasKeyword "Isomerase" .
?x chembl:hasTarget ?protein .
?protein chembl:hasDescription ?pdesc .
?x chembl:organism  "Mus musculus" .
?x chembl:hasDescription ?DESC .
?x chembl:extractedFrom ?resource .
?resource <> ?pmid

I’ve been following the discussion on RDF and Semantic Web for some time. While I can see a number of benefits from this approach, I’ve never been fully convinced as to the utility. In too many cases, the use cases I’ve seen (such as the one above) could have been done relatively trivially via traditional SQL queries. There hasn’t been a really novel use case that leads to ‘Aha! So that’s what it’s good for’

Egons’ announcement today, led to a discussion on FriendFeed. I think I finally got the point that SPARQL queries are not magic and could indeed be replaced by traditional SQL. The primary value in RDF is the presence of linked data – which is slowly accumulating in the life sciences (cf. LODD and Bio2RDF).

Of the various features of RDF that I’ve heard about, the ability to define and use equivalence relationships seems very useful. I can see this being used to jump from domain to domain by recognizing properties that are equivalent across domains. Yet, as far as I can tell, this requires that somebody defines these equivalences manually. If we have to do that, one could argue that it’s not really different from defining a mapping table to link two RDBMS’s.

But I suppose in the end what I’d like to see is using all this RDF data to perform automated or semi-automated inferencing. In other words, what non-obvious relationships can be draw from a collection of facts and relationships? In absence of that, I am not necessarily pulling out a novel relationship (though I may be pulling out facts that I did not necessarily know) by constructing a SPARQL query. Is such inferencing even possible?

On those lines, I considered an interesting set of linked data – could we generate a geographically annotated version of PubMed. Essentially, identify a city and country for each PubMed ID. This could be converted to RDF and linked to other sources. One could start asking questions such as are people around me working on a certain topic? or what proteins are the focus of research in region X? Clearly, such a dataset does not require RDF per se. But given that geolocation data is qualitatively different from say UniProt ID’s and PubMed ID’s, it’d be interesting to see whether anything came of this. As a first step, here’s BioPython code to retrieve the Affiliation field from PubMed entries from 2009 and 2010.

from Bio import Entrez

startYear = 2009
endYear = 2010 = ""
h = Entrez.esearch(db='pubmed', term='%d:%d[dp]' % (startYear,endYear), retmax=1000000)
records =['IdList']
print 'Got %d records' % (len(records))
o = open('geo.txt', 'w')
for pmid in records:
    print 'Processing PMID %s' % (pmid)
    hf = Entrez.efetch(db='pubmed', id=pmid, retmode='xml', rettype='full')
    details =[0]
        aff = details['MedlineCitation']['Article']['Affiliation']
    except KeyError:
        print '%s had no affiliation' % (pmid)
        o.write('%s\t%s\n' % (pmid, aff.encode('latin-1')))
    except UnicodeEncodeError:
        'Cant encode for %s' % (pmid)

Using data from the National Geospatial Agency, it shouldn’t be too difficult to link PubMed ID’s to geography.

Written by Rajarshi Guha

February 10th, 2010 at 4:40 am

When is a Bad Plate Bad?

without comments

When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates are good and which ones need to be redone. A common metric is the Z-factor which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.

Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics – SSMD (defined here and applications to quality control discussed here) is regarded as more effective than Z-factor – and in fact it’s advisable to look at multiple metrics rather than depend on any single one.

An alternative trick is to compare the Z-factor for a given plate to the trimmed Z-factor, which is evaluated using the trimmed mean and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we’d probably consider plates whose trimmed Z-factor are less than 0.5  and close to the diagonal. (Though for RNAi screens, Z’ = 0.5 might be too stringent).

From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we’d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.

A plot of Z-factor versus trimmed Z-factor for a set of 100 plates

A plot of Z-factor versus trimmed Z-factor for a set of 100 plates

Written by Rajarshi Guha

January 29th, 2010 at 5:47 pm

Are Bioinformatics Results Too Good To Be True?

with 2 comments

I came across an interesting paper by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue – “fishing for significance” and “publication bias”.

The former phenomenon is characterized by researchers identifying datasets on which their method works better than others or where a new method is (unconciously) optimized for  given set of datasets.  Then there is also the issue of validation of new methodologies, where she notes

… fitting a prediction model and estimating its error rate using the same training data set yields a downwardly biased error estimate commonly termed as ”apparent error”. Validation on independent fresh data is an important component of all prediction studies…

Boulesteix also points out that true, prospective validation is not always possible since the data may not be easily accessible to even available. She also notes that some of these problems could be mitigated by authors being very clear about the limitations and dataset assumptions they make. As I have been reading the microarray literature recently to help me with RNAi screening data, I have seen the problem firsthand. There are hundreds of papers on normalization techniques and gene selection methods. And each one claims to be better than the others. But in most cases, the improvements seem incremental. Is the difference really significant? It’s not always clear.

I’ll also note that this same problem is also likely present in the cheminformatics literature. There are any papers which claim that their SVM (or some other algorithm) implementation does better than previous reports on modeling something or the other. Is a 5% improvement really that good? Is it significant? Luckily there are recent efforts, such as SAMPL and the solubility challenge to address these issues in various areas of cheminformatics. Also, there is a nice and very simple metric recently developed to compare different methods (focusing on rankings generated by virtual screening methods).

The issue of publication bias also plays a role in this problem – negative results are difficult to publish and hence a researcher will try and find a positive spin on results that may not even be significant. For example, a well designed methodology paper will be difficult to publish if it cannot be shown to be better than other methods. One could get around such a rejection by cherry picking datasets (even when noting that such a dataset is cherry picked, it limits the utility of the paper in my opinion), or by avoiding comparisons with certain other methods. So while a researcher may end up with a paper, it’s more CV padding than an actual improvement in the state of the art.

But as Boulesteix notes, “a negative aspect … may be counterbalanced by positive aspects“. Thus even though a method might not provide better accuracy than other methods, it might be better suited for specific situations or may provide a new insight into the underlying problem or even highlight open questions.

While the observations in this paper are not new, they are well articulated and highlight the dangers that can arise from a publish-or-perish and positive-results-only system.

Written by Rajarshi Guha

November 29th, 2009 at 3:27 pm

Another Conference Done

without comments

The CHI RNAi conference is over and will now head back home. Being new to the field of RNAi screening, I’ve been looking for a place (virtual or real) where I can meet other people, especially those working in large scale screening facilities. Reading the literature is certainly useful, but face to face interactions are always richer. I was very pleased to see the meeting was of a high quality. While it wasn’t always cutting edge (most of the work had been published, but is still new to me) there were some very interesting talks ranging from the use of RNAi screens to probe myeloma biology, mTOR addiction and reconstruction of genetic networks to meta-analysis of multiple RNAi screens for the identification of synthetic lethal targets, parallel chemical and RNAi screens and the use of complex phenotypes and their analysis. Of course, a lot of it went over my head – but that was to be expected :) I was also pleasantly surprised to see very few vendor talks – the bulk of the talks were from academics or staff of core facilities..I also got to meet a number of people involved in RNAi screening facilities and had some very enlightening discussions. A lot of things to implement and test when I get back home! Overall a very useful meeting and I hope to make it again next year.

Now, just need to get home and schedule the ACS CINF program for the Spring meeting.

Written by Rajarshi Guha

November 3rd, 2009 at 7:02 pm

Posted in bioinformatics,research

Tagged with ,