Semmle’s Insights Signal a Gestalt Shift in Security

“What our industry has is a security embarrassment,” said Semmle co-founder and CEO Oege De Moor, during an interview at the recent GitHub Universe in San Francisco. “Security today is a knowledge-sharing problem.”
Security has traditionally been thought of as the sole responsibly of the security team. But with the growth of continuous integration/continuous deployment (CI/CD) pipelines and the growing use of microservices, even 100-year-old companies like Fitch Ratings are coming to the realization for developers that security is everybody’s problem, all along the pipeline.
Having so many moving parts has given rise to security problems on a massive scale. Combine that with the explosion in the sheer amount of data being passed around, even small security breaches can take a company down. Equifax lost an estimated $60-$75 million in their data breach last year, and that does not include the financial impact on their individual customers. Yahoo’s value dropped $350 after a botched data breach impacted 3 billion customers. And they’re not alone.
The problem across the industry is that very few people are trained in security protocols, including basics like what to look for and how to find security breaches. You add that to sloppy practices of not applying patches (cough Equifax cough) and your company has a disaster waiting to happen.
Enter Semmle, the company that is at the bleeding edge of the database and programming-language research. Semmle’s data-driven software engineering platform was not possible 10 years ago and just came out of beta testing with some very heavy hitters. Microsoft, Google, NASA, Nasdaq, Dell, and Mozilla are all early adaptors, giving Semmle access to huge quantities of data to play with.
Semmle’s QL product is an analytics engine that not only pulls data from different databases together, but creates structured data out of code. It not only translates code into data but it also includes the data libraries for each of the languages in their query engine, allowing the associated LGTM product to do some deep level querying. Named after the popular developer sign-off “Looks Good To Me,” the software engineering analytics platform analyzes every repo commit and reviews it using a number of different criteria.
By turning code into structured data, the analytics engine can then include the code in searches for security breaches. Another advantage of turning code into searchable data is when a breach is uncovered, all of your repositories can then be searched for similar code and evaluate it, catching possible breaches in similar, but not exact code. This is made possible by adding the code language libraries so the engine recognizes syntax anomalies.
The platform doesn’t just provide you insights on which repositories are vulnerable, it delivers action items to fix the gaps and bring the repo up to snuff. “All of our insights have to be actionable,” said De Moor. This approach of viewing code as queryable data means that security research results can be codified as a query, and fixes suggested.
Creating Gestalt Change
People are realizing how important security is, but it’s still a dark art. Developers and others outside security teams don’t understand how it works. There’s some sort of vague idea of what needs to happen, but not anything specific.
“QL changes the entire approach,” said Galen Menzel, Semmle communications manager, in a happy hour interview. “So now it’s not just ‘this package has a security flaw’ or ‘here’s a widespread security flaw and you should check all your code to see if it might be in there somewhere.’ It’s codified as a query that’s run against your code, so now you get ‘in this file, on this line, you have this problem — you should fix that.’”
This extraordinary depth of data mining also allows the platform to present code fixes in the order in which the engineer is most likely to be interested in. By analyzing the behavior of developers who contribute to the product, they can be like Netflix, said De Moor. “‘If you liked this, you’ll like that.’ The idea is that engineers are going to work on the code fixes they like to work on first, so make it easy for them.”
At its core, Semmle is about empowering engineers, said Menzel. “Not only does it alert engineers to issues in their repos, but it also empowers them to understand what those problems are.”
There’s this huge gap between security research and security developers and security being adopted by developers (who still see it as a black art and not understandable), he said. Semmle completely bridges that gap. Now if there’s a security problem it manifests as an alert that’s flagged as something the developer can just go fix.
But the best thing, he said, is because of the intimacy between their code and the security finding, not only can they fix the bug, they can begin to understand what those sorts of security problems look like and will be less likely to write that bug in the future.
The Stars Are Out
With all this data, Semmle evaluates the star ratings on repos across GitHub and the code of those repos and found that the repos with the least buggy codes have the highest star rating. The value here is that if your star rating is below standard, LGTM can evaluate the code in the less-starred repo and provide a list of actionable items to fix to bring the star ratings up.
In this community-based security model, when clients are writing queries or refining queries that look through their code to find security breaches, and it turns out to have general application, it’s added to all the LGTM projects, allowing smaller companies to leverage the knowledge of the big guys. Code fixes on their platforms across not just their customers, but all open source projects as well.
“It’s pretty exciting to think about this self-reinforcing virtuous cycle where the queries make the software better and software projects serve as a massive testing ground for the queries and make the queries better,” said De Moor.
Other Applications
The security application is just one advantage of Semmle’s query-as-code model. While they are focusing on security as the key gap needing to be filled right now, clients are finding other useful ways to mine their code. One client took the code of 100 contractors and searched for known bugs. Because of LGTM’s specificity, the could tie faulty code to individual contractors. They found two of the contractors were writing code, the rest were writing bugs. By firing the bug writers, the company was able to save almost a million dollars annually, not including the cost of fixing the bugs and lost productivity.
While security is the most critical application right now, the possibilities are exciting, said De Moor. “We’re looking forward to seeing what else our clients can do with this platform.”
Feature image: Oege De Moor at Github Universe. Photo by TC Currie.