Can Masking Metadata Boost Open Source Security, Diversity?

Whether you call it a Trojan horse or a malicious code injection, there is more risk than ever of hackers not only gaining access to your code, but persuading you that they are a known, valued contributor.
In fact, more than half of breach incidents involve a malicious outsider. And 2021 saw a 650% increase in cyberattacks on open source suppliers.
This risk multiplies in large-scale open source communities that have not only thousands of contributors, but dependencies on other open source projects. Since 99% of stacks have at least some open source components, according to an April report from Synopsys, this isn’t just an open source problem. It’s a global tech one.
Fortunately, there is a simple technical solution to this security problem: mask your metadata before considering pull requests.
Sal Kimmich, open source developer advocate at Sonatype, told The New Stack about a new movement to use this practice to not only improve code quality and security, but improve diversity in the open source world, too.
Kimmich will be presenting this solution to real-world problems at Open Source Summit Europe, in Dublin, on Sept. 15.
Confirmation Bias Puts Your Code at Risk
GitHub is by far the most popular open source code repository; the lessons learned from masking metadata on it could be applied to other repositories, although the technical pathway may differ.
Currently, when a contributor to a project makes a pull request to suggest an alteration to the code or ReadMe documentation, a maintainer reviews the quality of the code and decides whether or not to approve it.
“This is a pretty straightforward process of seeing if it makes sense against the feature map, if it’s something that you actually do want to include,” Kimmich said. “And then, secondarily, is the code high enough quality to be able to effectively merge, to not break the system when it enters?”
Viability or “merge-ability” of code is mostly automated. But GitHub has an added social value indicator via identifying metadata — profile photo, name and handle — which, Kimmich said, is “providing additional information at a decision point, which does not need to be there.”
If you’re maintaining an open source project, it’s natural to have your favorites, your most trusted contributors. That is even an indicator of a successful open source community. However, that favoritism can create a halo effect or signal processing problem, which, at best, ignores potentially better contributions, and, at worst, puts your project — and all the other projects integrating with it — at risk for malicious attacks.
“Open source is a deeply human activity. We just want to engineer it to reduce the bias that comes with that.”
– Sal Kimmich, open source developer advocate, Sonatype
“These are called malicious injections to open source projects, where they will typically aim to get a hold of the email and GitHub handle of either a known positive contributor, or — if they can get it — a known maintainer, and just take over the project through that,” Kimmich said.
This happens because confirmation bias makes us easier on our trusted collaborators, and maybe we don’t check the code of known, valued contributors or maintainers as thoroughly, according to Kimmich. Maintainers can get into the habit of verifying at this social indicator level. If it merges without breaking, that’s enough, right?
Not really. This is interfering with your signal process for finding the best code, they said. And how are you so sure it’s really one of your favorite contributors anyway?
“You still want to be having high enough scrutiny to ask the question: If it’s a known contributor, is this really them?” Kimmich said. “And right now, without blinding those commits, we’re leaving the open source community wildly open to these social attack vectors.”
Coding is a language that’s applied creatively, and thus the commits from individual users of GitHub and its ilk include naturally identifying characteristics, comments and emoji usage. Kimmich observed that hackers are becoming more sophisticated, and they will even mimic those language choices when they’re trying to steal another user’s identity.
And, typically, dangerous users aren’t submitting obviously malicious code. They are often solving a real problem for a project — but, somewhere in that good code, they are including malicious code that isn’t always easily discoverable.
The emerging security best practice of masking metadata prevents this preferential treatment and reduces the likelihood of successful malicious code injection.
Masked Pull Requests Can Increase Diversity
Masking identifying metadata at the start of the pull request review process should significantly decrease one of the risk factors for malicious injections. It also is proven likely to increase the quality of approved pull requests and improve diversity, equality and inclusion — something that the open source community is significantly worse at than even the tech industry as a whole.
A well-cited study from 2017 examined open source GitHub pull requests by almost 3 million users. It observed that, when their profiles obscure their gender, women’s pull requests have a rate of acceptance that’s 4% higher than that of men. But when their gender is apparent, they drop nine points below men’s acceptance rate.
This study indicated that women’s code was of higher quality, with less need to refactor before merging, but predominantly male maintainers were leaning on affinity bias to cloud their judgment.
With this study in mind, when mentoring more than 50 women-identifying open source newbies, Kimmich has repeatedly advised their mentees to keep their GitHub handles gender-neutral. But, for them, masking data is a sensible solution about quality and security before it’s even about equality.
Looking back on the 2017 study, Kimmich noted, “In that case, we were removing the probability of receiving the highest quality commits to open source. So this, to me, is not even a gender issue. This is: Are we actually getting the highest quality information?
“And if we’re not receiving the highest quality information when it is provided, is there something about the system which is skewing that?” they asked, pointing again to the superfluous metadata contributing to a halo effect.
This study backs up the idea that code quality improves if gender-identifying characteristics are masked at the pull request level. Furthermore, it would help prevent unfair treatment and bias that becomes one of several deterrents for women and other minoritized genders to contribute to open source.
While 85% of men surveyed by the Linux Foundation in 2021 said they agree with the statement “I feel welcome in open source,” only 74% of women surveyed said the same. For women located in North America, only 65% said they felt welcome in open source. For respondents worldwide who identified as nonbinary, third gender or other, the figure was 62%.
Building a Community Despite Masked Metadata
Masking the identifying metadata of code contributions has a time and a place, Kimmich said. It’s not likely appropriate for private or enterprise projects — most organizations want to see their employees’ work.
It would also be rather futile in a smaller open source community of only a few hundred people because of identifying coding styles. Masking metadata is well-suited for highly distributed communities with thousands of contributors, like Kubernetes and Prometheus, which can have maintainers and projects distributed around the world.
“When we get that distributed, that’s the kind of project that they’re putting malicious injections into,” Kimmich said. “That’s the kind of project where you’re receiving so many PRs in a day that you’re likely to be leaning on your internal biases, [rather] than your ability to verify the code. On those projects, this kind of best practice should be in place.”
These are also the open source projects with the most global impact when they have a security breach.
Blanket application of masking of metadata would also risk the community part of an open source community. You definitely want to be able to welcome new contributors and shine praise and gratitude on repeat ones.
In fact, the desire to be part of a community drives most open source contributions. A recent paper submitted to the IEEE International Conference on Software Engineering found the following motivations for why folks contribute to open source projects:
- 91% for fun
- 89% enjoy helping others
- 85% to give back
- 80% for kinship
- 68% seek reputation
It’s clear that community and recognition are essential to open source sustainability.
Open source also has a strong mentoring arm. You need to know who is contributing in order to provide constructive feedback. And in order to hopefully recruit new active contributors and upskill new maintainers for essential projects. It’s also important to track demographics to understand and measure the success of any diversity and inclusion efforts or lack thereof.
For these many reasons, Kimmich said, they are not trying to make masked metadata a mandatory feature within GitHub, but rather one that can be turned on and off.
They advocate that it be turned on when the initial pull request is received. As soon as the maintainer chooses to accept/reject or engage with that code on the quality of the contribution alone, the metadata should be unmasked. This allows that maintainer to then engage at a social level with their contributor community and to provide more personalized feedback.
Furthermore, Kimmich is hoping that masking of metadata be instilled within the Cloud Native Computing Foundation’s security best practices, as it “improves the information processing around malicious injection [and] social bias, and we contribute a good best practice to the open source system.”
Update: Since publishing this article, we had a chance to talk to Demetris Cheatham, GitHub’s senior director of diversity, inclusion and belonging strategy. While she called the security angle of masking metadata intriguing, she responded that, for diversity and inclusion purposes, she is “personally not a fan of blind pull requests or resumes. I want you to see me. To see me and appreciate the experiences and perspectives that I bring that might be different for others.”
Open source is not just about the code for Cheatham, but the people and perspectives building it. “If you are receiving a pull request from someone and you know you have a perspective where others actually bring.” Her experience as a first-generation college student who grew up in the poorest part of North Carolina and graduated from an HBCU is as important as her professional accomplishments. “More diverse teams have more innovation and creativity,” in part, she pointed out, because different perspectives affect product, like a person holding back a release to fix something.
This is why Cheatham’s team is building nuance into GitHub’s All In internship program. When the world went suddenly online as the pandemic kicked off, campuses, libraries and McDonald’s were closed, which blocked off access to coveted tech internships for those that didn’t have internet access. She said, “a lot of students were left behind. Those students couldn’t have internships and now aren’t getting jobs.”
“I’d actually like to start highlighting these differences,” instead of obscuring them, she said. Solely considering the DEI potential of data masking, Cheatham said, “We have to make sure the solution isn’t a bandaid that’s masking things that aren’t being addressed, [like] usually equity issues.” There is no easy fix for DEI concerns — the work has to be put in.