Data / Storage

Microsoft’s Open Source Scalar Brings Large Repository Support for Git

24 Feb 2020 11:10am, by

For companies like Microsoft, scale pertains to every part of the software delivery lifecycle, not just handling traffic spikes from user bases that span the globe. That means even the code repository has to be able to handle the needs of more than a hundred thousand developers and repositories that span into the millions of files, and last week, Microsoft made some strides toward that end with the release of the open source project Scalar. Scalar is a .NET Core application for Windows and Mac (Linux pending) that works to support very large code repositories, maximizing git performance by setting recommended config values and running background maintenance.

According to Derrick Stolee, a principal software engineer with Microsoft Azure DevOps who is part of the team that created Scalar, git is “one of the most used and important” DevOps tools at Microsoft, with the distributed code repository used to build products like Microsoft Office, Azure, and Windows, which he says they believe to be the largest git repo in existence.

The work actually started with an earlier project, the Virtual File System (VFS) for Git, which uses a virtualized filesystem to make git think the files are present, when they are not, and manages git’s internal state to only consider files accessed, ensuring quick execution of commands like status and checkout. Microsoft had been using VFS for Git (formerly GVFS) to support the Windows repository, but along the way identified performance bottlenecks. Subsequently, the team responsible for creating VFS for Git began working on a way to eliminate the need for the virtual filesystem and created Scalar.

“It’s a continuation in the spirit, as it aims at solving similar issues, but using different approaches,” explained Stolee in an email interview with The New Stack. “VFS for Git uses a virtualized filesystem, while Scalar leverages some improvements in the git core and other solutions to bypass the need of a virtualized filesystem. The core idea of Scalar came about by examining how VFS for Git works and considering what would happen if we removed the virtualized filesystem. By swapping the git sparse-checkout feature for the virtualized filesystem, we were able to get a prototype of what became Scalar rather quickly.”

Stolee takes care to offer a detailed explanation of the “three important lessons that informed Scalar’s design” in the blog post, listing them as focusing on the files that matter, reducing object transfer, and not waiting for expensive operations. As such, Scalar does several things similarly to VFS for Git but, again, without the virtualized filesystem. Stolee said that “some significant shifts” to Scalar’s architecture compared to VFS for Git allowed them to move the project entirely onto .NET Core 3.0, which, in combination with abandoning the virtualized filesystems, helps Scalar support more platforms.

For git users with a large git repository, “scalar register” tells Scalar to configure certain settings and hooks to optimize performance, as well as run regular maintenance in the background to keep things moving quickly when needed. Stolee says these features are taking advantage of regular git features that already exist in git v2.25.0 but require expert knowledge to setup correctly. At the same time, he says that Scalar provides some additional functionality not present in git.

“For supporting large engineering teams within Microsoft, we require some features not available in git. Most of these revolve around reducing data transfer, which Azure Repos (part of Azure DevOps) solves with the GVFS protocol. The ‘scalar clone‘ command creates a new local repository and uses the GVFS protocol to download a smaller set of objects than the full git history, including every version of every file in history; these versions can be downloaded on-demand. It also initializes the repository with a sparse-checkout definition so not all files are populated at the start. The combination of sparse-checkout and on-demand downloads allow users to get started with extremely large repositories,” wrote Stolee.

Jordi Mon, a senior product marketing manager at GitLab, seemed cautiously optimistic in his evaluation of Scalar’s release, recalling the history of another attempt to solve the problems presented by large repositories with Git Large File Storage (LFS). Developed originally by people from Atlassian and GitHub, among others, the solution ultimately was found to be not easy to integrate, requiring its own commands, and capable of considerably slowing down performance, he said.

“[T]his is a bold move by Microsoft. In essence, what they are saying is we are not waiting for native large file support in git, we have done it ourselves. It’s difficult to say how successful of a move this will be until user feedback on performance and UX is publicly available,” wrote Mon in an email. “.NET Core seems to be a leap forward from .NET framework in many ways, and I am sure Scalar will only prove that to be true.”

Indeed, in the blog post, Stolee makes clear that the team is “intentionally are making Scalar do less and investing in making git do more” and he expanded on this point by email.

“While we have contributed many features to git that make it able to handle larger repositories than it did before, there are still some major hurdles to clear before all of our engineers can rely entirely on git,” wrote Stolee. “Scalar is our way of bridging that gap until we can contribute enough features to git that Scalar becomes unnecessary. We expect this effort to take a few years, but we need to support the needs of large engineering teams inside Microsoft today.”

Scalar, rather, demonstrates the value of features like background maintenance, says Stolee and any implementation in git is likely to look very different. He likens this to the extra HTTP API for fetching objects dynamically that was used in VFS for Git, which eventually inspired the partial clone feature built by Microsoft’s Jeff Hostetler and Google’s Jonathan Tan, which was then improved upon by the community. Stolee says he hopes for this same sort of community support with bringing Scalar’s features to git.

“Together, we usually discover an improvement on our initial approach and we look forward to that open collaboration on the features that would allow us to move off of Scalar and onto git itself,” wrote Stolee.

By working on upstream git, of course, Scalar’s features become such that they may influence downstream projects, such as GitHub and GitLab. Stolee said the team is currently sending patches upstream to G\git to enable support for Scalar, but not Scalar itself, to be included in official builds. The company does not, however, have any announcements related to how VFS for Git or Scalar will be supported on GitHub yet. Jordi Mon explained that Scalar’s eventual integration into GitLab would come as a result of its standard process.

“I can’t speak for the Product team but if its open source and the community requests it, it could likely end up in GitLab’s roadmap. Unlike Microsoft or GitHub, who deliver closed source software and don’t open their roadmaps, most of our features are community-driven and eventually included in our publicly available roadmap,” wrote Mon.

GitLab is a sponsor of The New Stack.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.