Data / Development / Open Source / Sponsored / Contributed

Open Source Builders: Why Data Scientists Love Matplotlib

26 May 2020 1:00pm, by

This is part of a series on Open Source Builders. For a list of other articles in this series, check out the introductory post.

Amazon Web Services (AWS) sponsored this post.

Matt Asay
Matt is a principal at AWS and has been involved in open source and all that it enables (cloud, machine learning, data infrastructure, mobile, etc.) for nearly two decades, working for a variety of open source companies and writing regularly for InfoWorld and TechRepublic. You can follow him on Twitter (@mjasay).

No matter how many thousands of large data sets you may be crunching with TensorFlow, or how much you use PyTorch to accelerate tensor computation with GPUs, at some point you’ll want to represent your results with cross-platform charts and figures. And for that, you’re almost certainly going to want to get to know Matplotlib, an essential Python 2D plotting library for data visualization. Though Matplotlib is beloved by data scientists, its roots are in physical science, oceanography, and climatology. Data science folks came later, borrowed the core libraries, and have applied them to more corporate uses.

Though today there is an ever-growing universe of Python-based data science tools and libraries, for years Matplotlib was the only way to make plots in Python; and it remains the default. At the heart of the Matplotlib development community is project lead Thomas Caswell, who found his way to leadership almost by accident as he went from answering Matplotlib questions on Stack Overflow, to submitting bug fixes to authoring patches.

In a recent web conference, Caswell walked me through his journey to Matplotlib leadership and why he contributes.

A Contribution Evolution

Caswell wasn’t the founder of Matplotlib — that honor goes to John Hunter, an epilepsy researcher at the University of Chicago Medical Center in the early 2000s. Hunter grew tired of fighting for access to the hardware key dongle that allowed him to use a proprietary software program for doing electrocorticography analysis. Hunter first tried to replace this program with MATLAB but found it unsuitable for his needs, so he set out to build what became Matplotlib.

During this time, Caswell had his own struggles with MATLAB related to memory management and was looking for options to further his academic work at the University of Chicago. As he dove into Python, he naturally ran into Matplotlib and, as mentioned, first contributed insight and eventually code. That code was made all the better under the tutelage of Mike Droettboom, who assumed project leadership after Hunter’s unfortunate passing in 2012. As Caswell remembers, “Droettboom taught me almost everything I know about programming.” Caswell worked closely with Droettboom and, over time, became Matplotlib’s lead maintainer.

How Caswell’s Matplotlib contributions evolved is worth noting, because this evolution is a useful guide for others who may want to start contributing to an open source project.

Caswell notes that answering Stack Overflow questions turns out to be an exceptional way to learn a library, because it puts you into a position to encounter others’ use cases. It was also an ideal way to start “fixing” bugs in the code without touching the code. Caswell says that eventually he was given commit rights so that he could apply pressure on the bug backlog in the other direction.

At the same time, Caswell’s experience surfaces another facet of community-driven open source projects: you can’t force it. Caswell says that over the past several years, Matplotlib’s development has been entirely volunteer-driven — by a combination of people from industry who do it either on their discretionary time at work or on nights and weekends, and a collection of professors and students. He says this makes for an interesting management problem, because you can’t tell anyone to do anything. There is “no coercion” in the community — just persuasion.

Add to this the interesting conflicts that arise when you have primarily text communication between people from different cultural backgrounds, he says, and managing an open source community ends up offering MBA-level experience to people who likely have zero interest in an MBA.

Familiar but Different

Over the years, one of the guiding principles of Matplotlib has been to retain some connection to MATLAB while also innovating. That tie back to MATLAB has been important, because so much of the potential user community has historically started with MATLAB while in science and engineering classes at universities.

Herein lies one of the great strengths of Matplotlib, as well as a fundamental tension: how to balance familiarity with innovation.

As Python has become the de facto language for data science, and is widely taught in universities, an ever-rising percentage of Matplotlib users have never used MATLAB. This frees the Matplotlib community from needing to hew to the MATLAB standard. Caswell says the benefit of MATLAB  familiarity is starting to wane. He’s quick to add, however, “The Python world is not killing MATLAB. They’re also growing like crazy. We’re just growing faster.”

But what to build to stoke future growth?

“If you make a change that costs all of your users two hours, that’s a huge hit to global productivity,” Caswell says, prompting the project community to take care about introducing changes to the API. At the same time, he says you must evolve and add new features to keep up with evolving user requirements. “If you don’t keep up, you’re going to get replaced,” he says. This balance is a key tension that Caswell — and other project maintainers — must deal with on an ongoing basis.

Making Matplotlib Pay

Given this utility for so many others, I asked Caswell how much Matplotlib contributes to his work at Brookhaven National Laboratory (BNL), which is used “everywhere” within BNL. Caswell spends five to ten percent of his work time contributing upstream to Matplotlib. That percentage may go up, thanks to a $250,000 grant from the Chan Zuckerberg Initiative to help Matplotlib developers address its maintenance backlog, among other things.

Caswell may not do as much of the actual coding for Matplotlib anymore, but as project lead it’s still a significant time commitment — without getting paid for most of that work. Why does he do it?

In his graduate school years, he figured out pretty quickly that he didn’t want to be a professor. “That did not look like fun at all,” he says. Instead he discovered that he loves building tools and really wanted to build better tools for scientists. “That’s the thing that keeps me going,” he concludes. “Thinking about the grad student alone in their lab two stories underground at 11 p.m. on a Saturday. Supporting that person is what keeps me going. That’s my passion.”

There are many ways to contribute to Matplotlib, and all are welcome. If you or your organization use Matplotlib, the community would love to feature your use case on the Matplotlib blog. Visit their How to Contribute page to learn more, and check out the Matplotlib Developers’ Guide to find out how to contribute documentation, bug reports/fixes, or other code.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.