I came across an interesting paper by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue – “fishing for significance” and “publication bias”.
The former phenomenon is characterized by researchers identifying datasets on which their method works better than others or where a new method is (unconciously) optimized for given set of datasets. Then there is also the issue of validation of new methodologies, where she notes
… ﬁtting a prediction model and estimating its error rate using the same training data set yields a downwardly biased error estimate commonly termed as ”apparent error”. Validation on independent fresh data is an important component of all prediction studies…
Boulesteix also points out that true, prospective validation is not always possible since the data may not be easily accessible to even available. She also notes that some of these problems could be mitigated by authors being very clear about the limitations and dataset assumptions they make. As I have been reading the microarray literature recently to help me with RNAi screening data, I have seen the problem firsthand. There are hundreds of papers on normalization techniques and gene selection methods. And each one claims to be better than the others. But in most cases, the improvements seem incremental. Is the difference really significant? It’s not always clear.
I’ll also note that this same problem is also likely present in the cheminformatics literature. There are any papers which claim that their SVM (or some other algorithm) implementation does better than previous reports on modeling something or the other. Is a 5% improvement really that good? Is it significant? Luckily there are recent efforts, such as SAMPL and the solubility challenge to address these issues in various areas of cheminformatics. Also, there is a nice and very simple metric recently developed to compare different methods (focusing on rankings generated by virtual screening methods).
The issue of publication bias also plays a role in this problem – negative results are difficult to publish and hence a researcher will try and find a positive spin on results that may not even be significant. For example, a well designed methodology paper will be difficult to publish if it cannot be shown to be better than other methods. One could get around such a rejection by cherry picking datasets (even when noting that such a dataset is cherry picked, it limits the utility of the paper in my opinion), or by avoiding comparisons with certain other methods. So while a researcher may end up with a paper, it’s more CV padding than an actual improvement in the state of the art.
But as Boulesteix notes, “a negative aspect … may be counterbalanced by positive aspects“. Thus even though a method might not provide better accuracy than other methods, it might be better suited for specific situations or may provide a new insight into the underlying problem or even highlight open questions.
While the observations in this paper are not new, they are well articulated and highlight the dangers that can arise from a publish-or-perish and positive-results-only system.