Data / Development

How Code Analytics Could Help GitHub Decipher Its Semantic Code Graph of Open Source

2 Dec 2019 12:17pm, by

By 2025, there will be over 100 million developers on GitHub, CEO Nat Friedman predicted at the GitHub Universe conference recently, noting that 99% of software projects already include open source. There were 87 million pull requests on GitHub last year. 7.6 million security alerts have been fixed by developers and maintainers using more than 200,000 automated updates generated by the Dependabot tool.

If developers are going to cope with that sort of scale without burning out and keep open source the success story it’s become, GitHub needs to create systems and processes to help them. And a deep understanding of the code that lives on GitHub can help make those smarter and more useful.

The new GitHub new mobile app, smarter task assignment and the improved search and intelligent code navigation features should all help with developer productivity. Being able to jump from one reference to the next or find the line of code you’re really looking for rather than being presented with all the comments that happen to have that sequence of letters in will make it faster and easier to navigate inside a large codebase. Similarly, making the CodeQL semantic code analysis engine (that GitHub acquired when it bought Semmle in September) free for research and open source development will help security researchers find new CVEs and developers automate security checks of their codebase for common patterns like buffer overflows and cross-site scripting.

But to make these new features work, GitHub is doing more than creating simple indexes and dependency graphs. Instead, what it’s generating behind the scenes is more like a semantic code graph of all the public repos on GitHub, and that offers enormous opportunities to understand and improve coding patterns, quality and security — across everything from only sending relevant notifications and picking the right developer to send a pull request to for review, to documenting code or helping you find the right modules to refactor.

Navigating through code right in the browser. Image: GitHub

The Meaning of Code

Code search and code navigation both rely on the open source Semantic library for parsing, analyzing and comparing source code in a range of programming languages. Code search handles case sensitivity, tokenization and special characters, as well as letting you choose to match an exact string or an entire word. Instead of dozens of results, it finds far fewer and more relevant results.

Initially, code navigation works in Ruby, Python or Go repos, letting you jump to code definitions and find all references in the repo that call a function or method right in the browser without needing to open the code in an IDE. GitHub open sourced the library so more language communities would build bindings for it: it also supports other languages, including JavaScript, TypeScript and Java, and code navigation will work with more of those languages soon.

“Syntax varies across languages but the implicit language constructs are fairly consistent because most of these languages are object oriented,” GitHub vice president of strategy Kelly Stirman explained to us. That’s how GitHub can know what’s a function or a method and extract the code that defines it and the places where it’s called, which form a graph representing the structure of the code.

“The Semantic library lets us develop this core graph concept, independent of the underlying language being used. That’s all about our understanding of the graph of the code within a particular project; there’s a different graph around the dependencies of the code within a project, and it’s all part of one master graph. If I use a certain library or a package, depending on which language you’re using those relationships are kind of like a Russian Doll, and across all those interrelationships [we can look at] things like licenses or using particular versions of code.”

Once GitHub has created the semantic graph of a repo so developers who already work on that code can search and navigate it, that also makes it easy for a new developer to get to grips with the code, whether they’re starting to work with it or just looking at the repo to help them learn more about coding.

“A large part of developers becoming product is learning how code works, finding it and navigating through it. Our goal is to make that kind of navigation seamless as part of the GitHub experience,” Shanku Niyogi who runs the product team at GitHub, told the New Stack. “By scanning and analyzing the code, and providing code navigation and code search, the ability to explore and find code and learn from that code becomes more natural.”

Instead of cloning the repo and building it to see how a method is used, developers can just see the code definition and find all the references in the browser, which helps them understand how the code works much more quickly.

One of the first places where GitHub has been taking advantage of the code graphs it’s building is for finding and fixing security vulnerabilities, with Dependabot, and with CodeQL which transforms code into database developers can query, explained Jamie Cool, vice president of security for GitHub products.

Security researchers will use CodeQL to find new vulnerabilities. “You could write a simple query to return all the statements in a million line codebase that are empty; it would be three lines in CodeQL but if you were to try to answer that question with a different tool you couldn’t,” Cool said. Instead of finding all integer overflows, you can find just the ones that have a source of network data that could be crafted to trigger the overflows. To find cross-site scripting vulnerabilities, you need to see where data is coming from and how it can be used, like inserting it into the DOM, which means tracking data as it moves through a program. CodeQL can find those lines of code, because it understands what the language constructs in the code actually do.

Developers may not write their own queries but they can use the built-in queries to check that they’re not adding a known type of vulnerability to their code, and organizations can customize those queries. That’s available in a new Visual Studio Code plugin: “that’s an attempt to make the tool more accessible to more developers because so many developers are on VS Code today,” Cool explained.

Learning from the Graph

“Before, code was something of a black box,” he points out; “We could parse it and tell you what the file types in the repo were but now we’re building up a graph of the code, we’re able to understand it in a way we weren’t before. Now we have that code graph, there’s all kinds of interesting applications. The area we are very much focused on right now are the security applications because we think that’s where there’s the most critical need to apply the technology — but once we have a graph of the code, there’s lots of different things that we can do.”

Or as Stirman puts it more ambitiously, “If you have the world’s code and you have the world’s developers, you’re sitting on an enormously valuable set of data and how do you use that data for good.” The new code review assignment options start to build on this, he suggests.

Today the options for code review assignment are round robin (which allocates to each developer on the project in turn) or the new load balancing (which tries to avoid overloading any one developer with work). “Increasingly a standard practice is when I’m checking in code to a project I need a minimum number of people to review it before it goes into master,” Stirman explained. “But who gets assigned the task of reviewing code? Well, in a lot of project teams, it’s whoever raises their hand, or a team leader assigns it to people that they think they can get it done. But it’s probably not the best distribution of reviewers.”

Ideally, though, it would make sense to send code reviews to someone who has the right expertise or experience to look at the pull request — who has reviewed similar code in other projects, or commented on similar issues, or even based on whose code approvals tend to result in successful builds rather than further issues.

Similarly, if someone is working on an issue or a pull request in an area where they don’t have that experience, or their code has had problems in the past, GitHub could help them out by suggesting training. “We have Learning Lab,” Stirman noted; “Today that’s a platform for you to create ways for people to learn. But in the future, when you find someone on the team who is working in an area they’ve not worked in before, like secure development where a lot of developers don’t have the expertise. Couldn’t we give you a learning course that helps you get better at that particular thing that you’re working on? It could be that GitHub recommends things for people to learn, based on all kinds of criteria.”

The improvements in notifications that let you filter to see where you’re ‘@’ mentioned rather than just part of a team that’s mentioned as whole, say, help developers pick out the important notifications to deal with (like which ones are urgent enough to handle from the new mobile app, where you only see issues and pull requests that you created, commented on or were assigned to).

The new “exact match” search understands special characters and the semantics of code to find the useful matches. Image: GitHub.

That’s about making a fire hose of information manageable, Cool says. “A huge part of code scanning is making sure that the alerts are the right ones – because if you’re a maintainer on a thousand repos, I’m going to send you a thousand emails there. I can do a better job of controlling how you get that information, which is what we’ve been doing with this ‘garden hose’ feature. But the alert, when you get it, has to be one that makes sense for you to go fix and not just be noise.”

To avoid supply chain attacks where a legitimate package is taken over by a malicious maintainer, repos are looking at an increasing range of reputation factors, like treating a repo differently when a new maintainer gets involved. Stirman calls that behavioral fingerprinting, and again the code graph could help with that. “Do their behavioral patterns indicate, potentially, someone behaving in your project in a way that is suspicious so you need to get to reject the pull request they’ve submitted?”

Developers considering using an open source component often look at how many stars it has and how widely it’s used, but other metrics could be more useful. “I want to know what I need to know about this component repo before I use it. Is it active? Is it solving the problems [that get reported], and security is in that same category. And, you know, the set of information and data that we’re getting is getting broader and broader as we’re doing more and more. There’s just there’s so much we can do to make everyone safer and more productive.”

Analyzing repos using the code graph could reveal best practices for coding and refactoring, and help with documentation by nudging the developer best placed to write comments. “Could we tell you in your code ‘this is semantically equivalent to this thing and we’re going to automatically rewrite the code to follow this path’? We could stub the comments, we could flag someone and say ‘you seem like the right person to comment on this code,’” Stirman speculated.

By understanding the global mono repo of so much open source, GitHub might be able to use the code graph to help with estimating how long development is likely to take, or looking at a pull request to see whether it actually changes the lines of code responsible for an issue

Developers with private repos could benefit without having to allow their code to be mined for the code graph, Stirman said (code navigation already works for private repos). “There are potentially things for a customer in their private repo that might help them in their other repos, and there’s enough data in enough public repos that there’s a lot for everyone to learn from.”

GitHub is clear that whatever features build on the code graph, it needs to use data without breaking the trust of customers. “Yes, we have the tools and skills to take advantage of that data, but we also have an enormous responsibility, to protect the data of our customers, and to use the data responsibly,” Stirman said.

But he also pointed out, “Right now this is something you do manually. Anyone could go and look at a public repo but could we surface things that would be meaningful to everyone?” Don’t expect automated features that change code wholesale either. “This isn’t something where we want to bring a sledgehammer to the problem. I think it’s something where we make the information available and we let people make the right choices for them.”

Feature image by stokpic from Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.