GitHub Applies Machine Learning to Alert Your Project Dependencies
GitHub, the world’s leading shared code repository, is home to literally millions of open source software projects, from tiny single-function packages to programmatic pillars of the tech world like, oh, React, or this little thing called Linux. Meaning that, if you build software, your code almost certainly depends on at least one of those handy GitHub-hosted open source projects. Which, as we now know all too well following last year’s left-pad disaster, can lead to unintended consequences.
Feeling vulnerable yet?
Well aware of the need to manage the ever-increasing complexity of project dependencies, as well as keep code safer within the interconnected open source ecosystem, GitHub’s data and analytics team has stepped up with two new features targeted at increasing security and creating transparency in the murky waters of project dependencies.
Co-Dependencies not so Anonymous
The first stage of the effort, dependency graphing, was announced last month at GitHub Universe and are live on the platform. Located under the “insights” tab on any GitHub repository, the dependency graphs show both all the packages and applications your project is connected to — as well as all the projects that, in turn, depend on your code. And all without ever leaving your repo.
Power Up the Graph
This week, GitHub also launched the next step of its security initiative: an active alert system to proactively notify users whenever one of your project’s dependencies is associated with known public security vulnerabilities. Public repos will automatically have security alerts enabled via their dependency graphs, but private repos need to opt in. By default, admins will be the first responders for security alerts, but anyone with repo access, from individuals to entire teams, can be added as alert recipients under repo settings.
When an alert is triggered for a potential vulnerability, the notification will highlight any dependencies affected. The most advanced feature of the new security alert system uses machine learning to include recommendations for replacement with known safe versions from the GitHub community if any exist.
Miju Han, engineering manager for data science and analytics at GitHub, spoke with The New Stack about the new features, which she called “a big deal for both our customers and our users” — both how they came to be, and where the company hopes to take them.
“I won’t say that the Equifax breach was a catalyst, it popped up while we were already working on this, but it really showcased how vulnerabilities multiply when you have dependencies,” Han said. “It’s a problem for everybody. So we wanted to figure out how to manage this and hopefully prevent the next Equifax — because it is inevitable.”
The security risks from dependencies are growing steadily, she noted. Many projects have easily more than 100 dependencies, which is a lot for even a team of developers to keep track of. “The Equifax CEO accused an engineer of not doing his job and that is why the leak happened, but when you have a system that depends on one person and a messy feed, that is not a secure system,” she said.
With dependency graphing, vulnerabilities are tracked automatically, and information is available by way of an API. The service also uses machine learning to suggest fixes for the vulnerable software.
“Today’s launch only starts this journey, but we fully intend to keep walking down the path toward the dream of a fully self-healing system, Where we can not only identify vulnerabilities and suggest remedies but seamlessly take care of them through the intelligent application of data.” she said.
And this is just the start of GitHub’s use of machine learning. Han foresees a time when machine learning can assist in even more complicated tasks, perhaps even one day assisting in the writing of the code itself. This dream of “self-completing code,” Han admits this is a moonshot, a feature none of us may see in our lifetimes. But the foundation is there.
“With the access to the rich world of data we have via GitHub, we are enriching data, parsing it, then annotating it and chopping it apart. Increasingly abstracting information on code blocks, not just individual lines. Next comes extracting more data on how those blocks perform, so we have a world where code is easier to understand and to onboard new people to work on it — which will render it more performant, and more secure,” she said.
“With this kind of data application and automation, we can focus not just on software, or lines of code, but on ideas — and creating the future of software development.”
Feature image by Janko Ferlič via Unsplash.