This paper by Prof. Tim Pederson in the Journal of Computational Linguistics highlights the need for authors of computational linguistics papers to release working software that can be used to reproduce results in their papers.
While the paper focuses on the field of computational linguistics (CL), the discussion is perfectly applicable to other fields that publish computational research. Given my background in cheminformatics, which is heavily dependent on the use of computational tools, the points raised in the paper very applicable. For example, Prof. Pederson states
While we have table after table of results to pore over, we usually don’t have access to the software that would allow us to reproduce those results.
He also highlights four points on how to produce software to reproduce the results of research. In this post, I wanted to highlight some aspects that have bugged me in the past and I think are important for transparency in computational research.
When talking about reproducible cheminformatics research, a number of issues arise. Note that while many studies employ “packages” from well known vendors to perform standard operations such as docking, I’m not considering them here (though commercial licenses and cost are certainly impediments to reproducing research conducted using such software). Rather, I’m focusing on research that aims to develop new or novel algorithms that are not available in current software.
The absolute, most important issue is to release the software. A link to a web page is all that’s required. Without a corresponding implementation, I believe that a paper describing a new algorithm should not get published.
Second is the issue of other people being to run your code without having to fill out license forms. As an academic, my opinion is that research is for the public. Profit et al comes a distance second. So I believe that academic software should not involve closed licenses. But also, such software should not depend on other components that have closed / commercial licenses (if possible). In other words, if I want to test somebodies code, the software stack should avoid onerous licensing issues. Use something like the GPL, LGPL or BSD (or any other Open Source license).
OK, now that I have the code, can I read a page to run it? Or do I need to look at the sources? Can I run it directly? Or is compilation required? Granted, academic groups are not software companies. But I think a certain level of usability is required – even if it’s a CLI. Either make it compile easily (for say C, C++ code) or else output proper messages. And provide documentation – yes it’s boring and tedious. But it’s good software engineering practice. I shouldn’t have to go through an “undisclosed kabalistic sequence” (as stated by Prof. Pederson) to regenerate the results in the paper.
Ideally, software would use standardized input and output formats, unless it’s really required to define a new one. Don’t reinvent the wheel – use toolkits to do the grunt work (and preferable Open Source toolkits such as the CDK or OpenBabel).
What this means for researchers in fields like cheminformatics and computational chemistry is that software development must be considered at the same level as algorithmic research. This implies that software must be written with the thought that other people will use it. So use proper software engineering practice, write documentation, provide a decent user interface (command line is fine!) and be responsible for it. Yes, this means more effort and learning proper practices. But why isn’t that part of the duties of a researcher developing methods in a computational field? (After all, monkeys can be trained to push the buttons). And frankly, students coming out of computational groups that develop algorithms and don’t know what version control (for example) is should be ashamed of themselves.
I think this will require some change of mindset on the part of academics performing research in this field. Yes, it’s one more thing to do (in addition to paper writing, grants, etc.), but in the end, without usable software (not just code slapped together to generate a table of numbers) to reproduce your results, a publication is just advertising, not a contribution.
(Of course I haven’t said anything about access to the data that was used generate the results, but that’s a topic for another post).