Culture / Development / Open Source / Technology

Python’s Developer-in-Residence Probes Pull Request Patterns

23 Nov 2021 6:00am, by
Python language logo on blue background

As a long-time contributor to the Python programming languageŁukasz Langa recently served as the release manager for Python versions 3.8 and 3.9. But the Poland-based developer is also the first person to hold the newly-created position of developer-in-residence for CPython (the reference implementation of the Python programming language).

Established by the nonprofit Python Software Foundation, the developer-in-residence position will “assist CPython volunteer maintainers and the Steering Council,” according to a Python Foundation blog post in April.

Among the position’s duties are “analytical research to understand the project’s volunteer hours and funding.”

This is how the world ended up with a fascinating blog post offering detailed, data-driven insights into where Python comes from, with statistics on everything from the distribution of pull requests to how Python’s core developers spend their time.

In one way it’s the story of a programming language — how, through thousands of small pull requests, it continues to evolve.

But Langa’s own involvement is also unique, ultimately proving that the story behind the story is just as interesting.

A Sponsored Residency

Langa sees his residency in the larger sweep of history, remembering the days when Python was developed only by volunteers — and by Guido van Rossum, who held a paid position at Dropbox where it was understood he’d also use his time to work on the programming language.

“And many of us were in the same situation,” Langa reminisced in a late-August appearance on the “Talk Python” podcast. “I was tolerated as a CPython core developer at Facebook, and some others were at their own respective companies.

“I was super frustrated by this, because of this tremendous value that we’re giving to the entire community, including multibillion-dollar corporations.”

But now he’s fulfilling Python’s first residency in a new format — sponsored by Google, but run through the Python Software Foundation.

“That is amazing. I am very happy that they did, because I really believe that this is something that might alter how we think about maintenance of community-driven projects like Python,” Langa said, adding, “This is a kind of a game-changer. We have not done this way of sponsoring a project before, where we’re actually thinking about the ‘software’ word in ‘Python Software Foundation’ — where we directly sponsor work on the source code.”

Langa takes his responsibility seriously, as his remarks on the podcast make clear: “I do believe my particular performance kind of will make or break future ideas on whether this should be extended to more people, right?”

He laughed, then added playfully “Or just closed down altogether! So it’s not only providing value to the project, it’s literally providing proof that this development model works. So yeah, there’s certain responsibility around it,”

Patterns in Pull Requests

Langa’s blog post emphasized the importance of transparency and visibility in the developer-in-residence position — which for him includes blogging regularly about the experience, keeping the rest of the community involved in his journey.

One of the tasks given to him, he wrote, was to search for patterns in the pull requests for libraries (as well as identifying their top contributors). So Langa began by exploring years of historical data from the python/cpython Git repository and its pull requests — converting that data into Python objects (with scripts he’s shared on GitHub) and then transforming it into a SQLite file, exploring it all with the data exploration/visualization tool Datasette.

The first interesting find? Which dates had the most merges.

“Right away you see that September 2019 was the most active recorded week in our database in terms of merges,” Langa wrote on his blog, adding “That’s no surprise, it was the week of our annual core sprint, that year happening at Bloomberg in London.”

Langa quickly generated a bar graph, where the bars for the sprint days tower over the other bars, between two and three times taller. He called it “tangible evidence those events are worth it.”

But soon he’d arrived at the statistics that can finally answer the question: where do Python’s core developers spend their time?

CPython consists of over 629,000 lines of Python code and more than 550,000 lines of C code. But by querying the data, Langa was able to identify the one single file that contains both the most changes since the beginning of 2019 (with 259 merged pull requests) and the most lines of code that have been changed (12,972). It’s Python/ceval.c — the 7,080-line file which actually executes the compiled code.

And to Langa’s surprise, No. 2 on the list of most merged pull requests is Python/pylifecycle.c, the interpreter for top-level routines (including init and exit), with 222 merged pull requests.

“Who would think the most change happens the deepest inside the interpreter?” he asked.

Langa was also able to tease out interesting information on who’s making pull requests. Interestingly, the No. 1  most frequent contributor is a GitHub user named miss-islington — a bot that automatically checks merged pull requests for any issues with backporting. (In keeping with Python’s roots, the bot was named after the character in “Monty Python and the Holy Grail who must insist to a mob that she is not a witch.)

“Clearly, it pays to be a bot (like miss-islington, web-flow, or blurb-it),” writes Langa, “or a release manager since this naturally causes you to make a lot of commits.”

The top two human contributors are Victor Stinner (paid by Red Hat to maintain Python upstream) and Serhiy Storchaka (a Ukraine-based Python core developer), both of whom Langa acknowledges for “amazing amounts of activity.” Stinner comes in at No. 2 with 3,775 merged pull requests, while Storchaka has 2,582.

Langa then tried writing a script identifying the top five contributors for each file — though after breaking it down into 636 categories, he still discovered that for 618 of those categories, two of the top five contributors were. …  Stinner and Storchaka, again. “In fact, some files are missing contributors entirely save for our two top giants,” Langa wrote.

But as reassuring as it was to find them watching over the project, Langa was also able to identify some “experts” who were “laser-focusing” on specific parts of Python (like its handling of email or code for handling types). This was information specifically requested by the Python Software Foundation, so it may prove useful as Python development continues in the future.

Drive-Bys and ‘Transformational Potential’

In a July blog post, Langa even argued that his role had “transformational potential” for Python. “In short, I believe the mission of the developer-in-residence is to accelerate the developer experience of everybody else. This includes not only the core development team, but most importantly the drive-by contributors submitting pull requests and creating issues on the tracker.”

Langa elaborated on the importance of casual contributors on the “Talk Python” podcast. “There’s a lot of us on the core team, and even more people around the core team who are kind of — well, we call them drive-by contributors,” he said.

“They would find an issue, produce a bunch of pull requests, and maybe then kind of disappear … Obviously every year this changes, how Python is developed. We’re going to have a bunch of people who are super invested, and they’re going to be spending crazy amounts of time, including on weekends and whatnot, to work on Python, even for free. I know — I did that for a decade.

“So those contributions are super valued. But usually, those really don’t — well, those people change, right? Like, you can’t really do this in a consistent manner, day-in, day-out, for a long period of time. Your life situation changes, your job changes or whatnot, and, you know, you stop contributing. And what happens to Python then? Well, we lose some value.”

Average Wait Times for Pull Request Merges

Langa thinks one way of encouraging more contributions is to merge more of the pull requests that have been submitted. “Currently we have over 1,400 open pull requests. And I’ve been on a mission to kind of bring that number down,” he wrote. “Currently as I’m looking at it, it’s 1,421.”

So there was another crucial question Langa explored in his Python-related data science: how long does it take to merge a pull request?

His first results showed the average wait time is 14.6 days, Langa wrote. (Although, “obviously, the answer in a big project is ‘it depends.’.Averages lie.”)

More research revealed that indeed, especially for non-core developers, the deviations from the average can be wildly large. Langa’s calculations place their standard deviation at 81.7 days, plus or minus. And while for core developers, the average wait time is nearly 9.5 days — their standard deviation is also high. (It’s at least just under 42 days, plus or minus — but it jumps to 77.4 days if the core developers aren’t merging their own pull requests.)

Meanwhile, pull requests that don’t get merged — but are closed instead — wait, on average, more than 105 days, Langa wrote. “But as I said, averages lie.”

The next steps are still to be determined. But, for now, there are fresh data-driven perspectives on the current state of Python development.

It’s all part of how Python’s first developer-in-residence is keeping the community involved in his journey — and he’s now even inviting them to help choose what he should investigate next.

“If you have any suggestions on things I could look at,” Langa’s blog post concluded, “let me know!”