Jupyter Notebooks Meet the Challenge of Reproducibility
Every application needs to have one killer feature. For the Jupyter notebook, that feature may be reproducibility. Although reproducibility was initially an academic challenge, the entire IT sector could enjoy the benefits of reproducible computational assets, with the Jupyter notebook acting as the Docker container for the project being reviewed.
Especially for data-driven research, many argue there is no reason why the researcher can’t offer the full data set and associated logic to allow the others to recreate the original work. Until recently, most research is simply summarized in a scientific paper, with the original calculations, and even the raw data set, lost to history. But when we are actually creating new knowledge with computers themselves, there is no reason not to also to offer a computational pathway to the researcher or project manager traveled to reach their conclusions.
This is where the utility of Jupyter notebook, formerly called IPython, comes in. It is an open-source web application for creating shareable documents with embedded-able live code, equations, visualizations and explanatory text.
And now the notebook format is beginning to find a home outside of academia, in regular business activities as well. Recently, O’Reilly held its first JupyterCon conference in New York. The company expected 400 people to attend, but 700 people showed up, many looking for ways to use Jupyter in decidedly non-academic environs.
“What Jupyter allows you to do is edit markdown cells. You can put text in there. You can put mathematical formulas in there. But you can also put your code in there. That code also has a run button, and the output you get is also put into the notebook,” explained O’Reilly Chief Technology Officer Andrew Odewahn, who was a co-chair of the conference. The Jupyter name is an amalgamation of Julia, Python and R, the three languages it natively supports (although there are pluggable kernels for 90 additional languages).
— The New Stack (@thenewstack) August 25, 2017
Reproducibility in Business
One group working heavily with Jupyter has been Microsoft.
“Jupyter notebooks are key to the way we prototype with our customers. It allows us to work across geographies, organizations, teams, and even across time, where we can document our thought processes for others,” said Microsoft’s Patty Ryan, during one session at the conference. In true Jupyter form, the company’s use cases are also documented.
In one case, the notebooks helped Microsoft with a large European confectioner that wanted to keep a closer watch on how retail outlets were stocking their items, specifically if they were shelving goods according to the policies set by the company. Heretofore it had largely been a manual effort, sending out people to individual stores to document how the shelves were arranged. The company wanted an app that could automatically assess, via an image and some deep learning (through either the handheld device itself or back in the cloud), how the shelves were stocked. It would be a complex app to build, even with the assistance of cognitive computing tools Microsoft offered through Azure.
“We find Jupyter notebooks really handy in these situations,” said Microsoft’s Michael Lanzetta. Microsoft held an informal one-week hackathon with the company’s developers to build some prototypes, which could be shared via the notebooks. The notebooks “allowed for quick iteration,” Lanzetta said. All the options were documented, and in a way that was far more thorough than just through source code alone. The most successful results were published for the company’s executives to preview.
Another heavy user of Jupyter has been O’Reilly Media, which is using Jupyter as a component in the way it is rethinking the way it presents technical content through its Safari platform.
“Jupyter notebooks become documents that allow you to narrate your computation with text and all sorts of annotations,” Odewahn said. “It allows you to do beautiful documentation on how your computation works.”
Interactivity vs. Reproducibility
If you haven’t caught it yet, the scientific community has been experiencing a bit of a crisis lately around the reproducibility. Across a number of fields, scientific studies that were thought to be impeachable were, in fact, impossible to difficult to replicate, casting light not only on the specific findings but at the modern scientific process itself.
Jupyter is the perfect tool for aiding in reproducibility, asserted Lorena Barba, a professor of mechanical engineering at George Washington University, speaking at the event. This is not only because a notebook can be used by anyone to reproduce the final calculations, but also because the user can interact with the data as well. A notebook can help critics pinpoint errors in interpretation, and speed replication with the insertion of new data sets.
The interactivity part is quite important, and somewhat a new addition to the discussion of reproducibility, Barba noted.
In her talk, Barba discussed Stanford University professor Jon Claerbout who is widely seen as the father of the reproducible computational research. In the early 1990s, he required his students to submit thesis research in a form where it could be recreated through a single click.
Claerbout’s idea was to “leave a finished work in a state where a coworker can create the complete the calculation and workflow with a single command,” Claerbout explained. Using Jupyter for this task is certainly superior to, say using Excel, which could indeed present the numbers and calculations to observers, but offer no context about what the researcher did to get to arrive at those results.
Relying on Excel for important calculations is like driving drunk: no matter how carefully you do it, a wreck is likely. #reproducibility
— Philip Stark (@philipbstark) August 11, 2014
— The New Stack (@thenewstack) August 25, 2017
Claerbout, however, was not a big fan of users meddling with the results, fearing any sort of interactivity within the program that would drive users off course. Barba sees the interactivity as an essential component of the scientific approach.
“Reproducibility is really about trust. It can’t work with one click,” she said.
Microsoft is a sponsor of The New Stack.