Navigating the Messy World of Open Source Contributor Data
Open source communities are inherently open and distributed, which makes open source software data inevitably messy. And when someone’s privacy — especially someone who has been contributing to your project for free — is at risk, messy is not OK. At this year’s KubeCon+CloudNativeCon North America, Sophia Vargas, research analyst and open source program manager at Google’s open source office, talked about how best to navigate the regulations and ethics of data management in an open source community.
Or, as she called it, “Things I wish I knew before I started this job.”
Define Your Project Contributions to Define Your Data
Part of the pain is that there is no single definition of an open source project. Yes, an open source project is defined by what open source license it holds, but from there, it could be open source code, datasets, a framework, metrics or a common language, or design. A license can apply to all different sorts of assets, each with its own assortment of data.
A lot of the data collected within an open source community is typically centered on contributions, in order to know who is doing what, who to acknowledge, who do you need to encourage more, how to attract more users, how to embrace diversity, equity and inclusion, and other goals that maybe traditional business models may not prioritize.
Since an open source model isn’t necessarily a business model, it demands whole new metrics that are community-based.
Vargas argues that a community contribution focus should determine your data sources. However, according to research by the University of Vermont from earlier this year, the “idiosyncratic methods” that most open source projects are measured by are notoriously hard to mine, leaving, for example, GitHub’s top contributors to be measured solely by actions taken on the platform, like commits, pushes and patches. But there are many other contributions available. The University of Vermont All Contributors Model acknowledges more diverse contributions, including outreach, finance, infrastructure, and community management. These groupings include:
- Data Source #1: Git, issues — the most often measured contributions like writing and reviewing code, bug triaging, quality assurance and testing, and any security-related activities.
- Data Source #2: Surveys — this is useful to measure community engagement, event organization, demographics, teaching and tutorial development.
- Data Source #3: Forums — documentation writing and editing, localization and translation, and troubleshooting and support.
- Data Source #4: Site Analytics — marketing, public relations, social media management and content creation.
- Data Source #5: Legal and finance — as projects scale, there are also significant contributions from legal counsel and financial management that are often measured in more traditional business ways.
This research found that more community-generated systems of contribution acknowledgment are necessary, not only to cultivate a culture of collaboration but to increase bug visibility and overall innovation.
Now, What Are You Doing with All That PII?
To start, you have to acknowledge you’re only looking at the subset of the information. And most of that is just in your repository. Vargas advocates for investigating alternative trajectories, including acknowledging that public data — sometimes — deserves classification. It can certainly be valuable to garner information ranging from demographics in an effort to improve diversity to knowing who is actively promoting your open source project on various socials.
This public data most often comes from social media, but could come from anywhere, including what you might say speaking at an event or meetup. It could also be anecdotal data that a contributor shares with a maintainer in a chat. Vargas said that the Github Push event payload has an author name in order to verify the person who made the original commit.
But, if it’s public data, what’s the problem? When you are combining datasets, you are risking personally identifiable information or PII. Risk classifications are dependent on the parties involved, where regulations, organizations and projects intersect.
“We encourage a right to anonymity, but, at the same time, the more you learn about someone on GitHub or on other channels, the more you can infer about them,” Vargas said.
Using herself as an example, she had introduced herself at the start of the talk that she works at Google and shared her Twitter, which says she lives in New York. Then on LinkedIn, she says when she went to college and where. If you cross that with any places she published or commits she’s made on one source projects, it can aggregate to include a lot of information, more than would typically be necessary for an open source maintainer to have. Still, if an open source project is looking to increase diversity, equity, inclusion and representation — and it should be — it needs much of those demographics. PII can also help you decide who to promote or give awards to within your community.
There’s also an argument for organizations that are contributing to open source to understand how their employees are interacting with it. Are you providing the right tooling to support them? Is the documentation sufficient? Of course, this could backfire if an organization decides to promote or compensate based on total number of commits, only again acknowledging the most technical contributions.
And we already know there are many reasons to want to separate your identity from your contributions, ranging from safety to a higher chance of success. It’s widely cited that, when women hide their gender on GitHub, they are more likely to have pull requests accepted than men counterparts — meaning that women are statistically better coders, but are more likely for their open source contributions to be ignored.
There’s no clear answer with any of this, but when your contributors’ privacy is at stake, maintainers should constantly consider how much data they need, how they are getting it, and how they are sharing it. And especially those who could be harmed by this data retention and sharing.
Different Ways of Managing and Measuring Open Source Data
Another challenge is that these open source datasets are innate as distributed as the systems that generate them. For Vargas at Google, she knows there are about 300 repositories just within Kubernetes alone, although maybe they aren’t all active.
It’s also hard with all that interconnectivity to understand where one project stops and another begins, with one perhaps forked into another or a new one being added. Git activity logs do not stay consistent over time. Historical information can change. Somebody could’ve made a mistake or forgot to count something.
“In the context of git, if the project has a lot of forms, that form gets merged back in, that fork becomes part of the history of the project, [and then] your historical and current state don’t necessarily match.” Vargas continued to warn that GitHub APIs have rate limits, which means you often end up in lost or missing data. There also could be cleaning and correction over time which will also alter data.
“There’s a little bit of a tribal knowledge problem if you don’t realize the perimeters of the project,” she said.
With so much data, it can be attractive to just lump it all together to see what comes up.
“Huge tables with everything in the same place is much easier to work with, but when we are talking about this information, it’s often better to give back to the people it’s about. But when you share back with your community, it increases the risk levels,” Vargas said.
For many of these situations, is the actual data even necessary? Could you leverage anonymized or synthetic data instead?
Whenever possible, Vargas recommends to tokenize and compartmentalize data. And only allow individual assessment that can be provoked. These big tables shared around will always be higher risk.
Also, be cognizant that the method in which you collect the data can actually impact that data. For example, if you are directly contacting someone, like through a survey, they may not be as direct. Demographics, opinions, feedback, satisfaction feelings and morale are best done anonymously. And of course provide opt-outs, so people do not have to answer every question if they do not want to.
Vargas recommends always being transparent about why you’re collecting data, like you want to encourage more participation or representation.
“If your community is very tiny, you may be able to figure out who they are. Decide if things are appropriate to share back,” Vargas warned. Even anonymously, it could be clear who is the source. We’ve already written about instances where DEI survey respondents have been concerned as being the only woman in her country working on a project.
You can pull from indirect data conclusions, often from aggregated or inferred data. Just be sure to exclude the 120 million bots Vargas estimates are in GitHub alone.
Be Careful What You Share, Too
Remember, when you measure, it influences the measurement itself.
Open source leaderboards are a clear motivator. But be aware of what puts people on top. After all, what you measure will undoubtedly influence the measurement itself. Often GitHub leaderboards just tout the number of commits, which could encourage a mass of smaller, less significant contributions, and only boast technical contributions. Be sure to acknowledge more qualitative contributions as well.
Focus on what behaviors and contributions you want to encourage and see more of in your community. Vargas offered some logical things to measure, depending on what outcome or outcomes you hope to achieve:
- Project growth — activity and contributions per repo per time
- Contributor growth — new versus retained contributors over time; activity per contributor over time
- Contributor diversity — the percentage of contribution from outside of your organization
- Contribution experience — time to respond or close pull requests
- Contribution quality — the percentage of pull requests closed and merged
Vargas recommends keeping contributions relevant, at least within the last year, and sharing that time period along with the results. And note nothing that externally would affect results, like the challenges of contributing during a pandemic.
Transparency matters. Always state what you are counting. Define what you are counting as a “contribution” or “engagement.”
And remember a good data scientist states her sources. What methods, assumptions, biases and boundaries influenced your data collection and processing? What types of contributors does this represent? Vargas shared an example of how she handles an “About this Data” section of data sharing.
Own Your Accountability
In the end, Vargas says you just have to assume you are accountable. She recommends that you:
- Know your licenses and regulations — and of those that you’re integrating with and tools you’re using
- Openly communicate your purpose and intent
- Design policies that foster trust, transparency and safety
- Provide an option to opt in and out
- Ask if you really even need to use personally identifiable information
- Always read the fine print to understand your own rights as a contributor
Also while this article focuses a lot on the world’s biggest open source repository GitHub, don’t think that’s your one and only source of data. Go where your community is.
Just remember, “Every time you add more PII, it adds more risk and potential vulnerability to your dataset.”