Reproducible Research, a manifest-like paper by a number of authors from different scientific disciplines, is published by Computing in Science and Engineering.
Progress in computational science is often hampered by researchers’ inability to independently reproduce or verify published results. Attendees at a roundtable at Yale Law School formulated a set of steps that scientists, funding agencies, and journals might take to improve the situation. We describe those steps here, along with a proposal for best practices using currently available options and some long-term goals for the development of new tools and standards.
Seismic Unix (SU) is a famous open-source seismic processing package maintained by John Stockwell at the Center for Wave Phenomena, Colorado School of Mines.
SU has been around for 25 years and has attracted many devoted users. If you are one of them, please consider the following:
- Using Seismic Unix is not an excuse for non-reproducible computational experiments. To facilitate reproducibility, you can use Python and SCons with the rsf.suproj module supplied by Madagascar. The book/rsf/su directory contains many examples of seismic data processing flows using SU and their loose translation to Madagascar analogs. Here is an example SConstruct script from rsf/su/sulab1
from rsf.suproj import *
Result('plane','suxwigb label1="Time (s)" label2=Trace')
Result('specfx','suxwigb label1="Frequency (Hz)" label2=Trace')
Its loose Madagascar translation is in rsf/su/rsflab1
from rsf.proj import *
spike n1=64 n2=32 d2=1 o2=1 label2=Trace unit2=
nsp=3 k1=8,20,32 k2=4 l2=28 p2=2,1,0
Flow('specfx','plane','spectra | scale axis=2')
for plot in ('plane','specfx'):
wiggle clip=1 transp=y yreverse=y poly=y
wanttitle=n wheretitle=b wherexlabel=t
If you want only rsf.suproj but not the rest of Madagascar, download madagascar-framework package from SourceForge.
It is also possible to convert between SU and RSF file formats with sfsuread and sfsuwrite and to combine SU and Madagascar programs in common processing flows.
If you decide to switch to Madagascar but are missing certain functionality from Seismic Unix, it is possible and legal to borrow code from SU and to add it to Madagascar. The opposite is not true, because the Madagascar license (GPL) is more restrictive than the SU license (BSD-like). The su directory contains some of the codes Madagascar has borrowed from SU by requests of the users. Naturally, we try to limit such borrowing to avoid unnecessary forks.
In a recent message to the Seismic Unix mailing list, John Stockwell described a proposal for S3, the third generation SU. The main requirements for S3 are:
- a new project managed on SourceForge
- GPL license
- flexible trace headers
- integration with scientific libraries
- integration with the current SU as well as other GPL- or BSD-licensed packages
One cannot help thinking that the project that John describes is Madagascar!
Eureka Daily, a science blog at The Times newspaper, published a two-part article by Hannah Devlin about freedom of information in science (“FOI: should scientists be exempt?” and “Freedom of information and climate science” – both require a subscription). Part two discusses issues of openness in the context of a recent investigation of research practices of CRU – a climate research group at The University of East Anglia.
Here is an interesting excerpt from the second part:
As Myles Allen, a climate scientist at the University of Oxford points out, in most cases that which is in the public interest will be good for science too. Validation and replication are central to the scientific method. However, points of contention remain about the optimum degree of information sharing. Allen, for instance, suggests that while open access to data is generally desirable, making the computer code used to analyse data available online could have unintended negative consequences. If everyone’s using the same code, who’s going to challenge whether it’s working correctly?
This view is countered by programmer John Graham-Cumming, who found coding errors after trying to reproduce the CRU/Met Office’s CRUTEM and HadCRUT global warming datasets. Working from the raw data released by the Met Office and the description of their process for generating the datasets in a scientific paper he decided to validate their work – a considerable effort that required writing code to implement the algorithm described in the paper. In doing so, he found a problem with the way the error ranges were calculated (amongst other errors), stemming from a bug in their code.
He says: “You could say that by not releasing their buggy code they forced me to find the bug in it by writing my own validation. But actually, if they’d released their code I would have been able to quickly compare the code and the paper and find the bug without the massive effort to write new code. And no one else had actually done this validation (including the Muir Russell review) and as a result the Met Office has been releasing incorrect data for a long time. Perhaps that’s because the validation was so hard in the first place, whereas having code to check would have been easy.”
John Graham-Cumming demonstrated why reproducibility is crucial for computational sciences: it exposes scientific algorithms and workflows to a greater audience, thus preventing critical bugs from going unnoticed.
Reproducibility is an approach to openness in computational sciences. It assumes that not only data but source code (and eveything else needed to reproduce published results) should be released. At the end of the day, it might save one’s scientific credibility from a rather unpleasant public exposé.
The July 2010 School and Workshop in Houston went well and attracted more than 50 people, about half of them being graduate students. About 10 companies and 12 universities were represented. The event was sponsored by the Petroleum Technology Transfer Council. All presentation materials from the workshop are now available on the website.