How Microsoft Forged a Scalable Git to Better Manage Windows Development
Frustrated by the limitations of BitKeeper, Linus Torvalds created the Git distributed source control software from scratch over a weekend a decade ago, as a better way to manage the ongoing development of the Linux kernel, which has seen thousands of developers add to its codebase over the years.
Feeling the limitations of its own version control efforts, Microsoft adopted git to manage its own considerable portfolio of software products. But while Git promised to ease multiple-developer work on Windows, the Redmond giant also found the software had difficulties scaling. With over 300GB of source code spread across 5 million files, Windows was too large for Git to handle. Getting a simple git status would take 18 minutes; a Git commit could take a half an hour.
So the company developed the Git Virtual File System to serve as a virtualization layer that would speed operations. At this year’s Microsoft Connect() conference, the company announced a partnership with GitHub — the largest git-based hosting service — to develop GVFS.
At the conference, we spoke with Microsoft Corporate Vice President Brian Harry, who oversees the company’s release management service, Visual Studio Team Services (VSTS), available on Azure (also known as Team Foundation Server (TFS), for those wishing to deploy such capabilities on-prem), to learn more about GVFS, as well as well as the latest releases of VSTS and TFS and the ongoing adoption of the DevOps methodology of application deployment and management.
With over 1,000 devs working on 832 repositories, Microsoft has become "one of the most prolific open source contributors, out of all the companies we see using #GitHub" — @isamuel #MSFTConnect2017 pic.twitter.com/gcOxOJib93
— Joab Jackson (@Joab_Jackson) November 15, 2017
What are you seeing in Microsoft shops, in terms of them adopting DevOps?
I think we see the same thing in the Microsoft community as we see in all of the developer communities, a lot of focus on being more agile and more responsiveness. back in the early 2000’s, the focus was on the beginning of the agile movement, with continuous integration and unit testing and practices like that. And then it evolved to SCRUM and then KANBAN and the project management aspects. And now the focus is really on automating the deployment processes. How do I remove the friction from getting all the code into production, so I can update more frequently, patch more frequently, respond to feedback more frequently? Every customer I talk to, that is what they want to talk about.
Is this being driven by competitive needs, or is it just a natural evolution of IT?
I think some of both. Some of it is driven by customer expectation. Customers just expect stuff to be fresh, to be new, to be constantly up-to-date. And some of it is driven by competition. If I’m on a six-month deployment cycle, and the other guy is on a two-week deployment cycle, I’m going to lose, because he is going to respond a lot more quickly.
Look, you’re always going to make mistakes. You are always going to miss things, get something not-right. The more iterations you have, the more opportunities you have to get it right.
In the old days, there was this very arm’s length relationship between IT and development, where the developers would develop something and they would hand it off to IT. It would take three months to provision the hardware needed to deploy it, then they’d have to go through a bunch of testing and validation of the deployment. At some point, it would get deployed. Meanwhile, the developers are months on doing something else and don’t even remember what they put in that release.
That sort of discontinuity created a massive amount of friction. So one of the big transitions is eliminating handoffs. And that can be a scary thing. It does mean your responsibilities change. I mean if I look at my organization: Years ago, it worked exactly just like I described. Now, ops has nothing to do with provisioning infrastructure because it’s all in the cloud, and the cloud just does it. Ops has nothing to do with deploying anymore. It’s all automated. It’s just scripts that come out of the engineering team that does the deployment. Does that mean ops has no job? No. Ops still does tons of stuff. For us, they do capacity planning, they still are deeply involved with incident management, security. Securing these systems is critical in today’s threat world, and way more involved than it was in the past.
So you just rejigger responsibilities so you change it so there is no handoff.
So the role of the system administrator has changed a great deal. OK, can you tell me what Visual Studio Team Services and Team Foundation server brings to the development environment?
Sure. So first, Team Foundation Server and Visual Studio Team Services, it’s weird that they have two different names, but they’re basically the same thing. Team Foundation Server is on premises and Visual Studio Team Services is in the cloud. They both offer a rich set of DevOps services, from planning that allows you to do semaphores and sprints and all of that stuff through source code control where you host git repos. We host a centralized version control system called Team Foundation Version Control. We have build capabilities with continuous integration and build pipelines. We have release capabilities for managing sort of stage releases across environments and managing deployments. We have testing capabilities. We have package management for managing all of your binary assets.
These are tools that speak to virtually every aspect of your DevOps lifecycle. And at the same time, they’re composable. So in years ago in one of the dings against TFS was it is a monolith, you have to use all of it. And really over the last few years, that’s changed a lot. You don’t have to. If you want to have your code in GitHub and you want to use our CI/CD system, great. It works great with GitHub. If you want to use GitHub for source code, Jenkins for your build and then you want to use our continuous deployment system for managing deployment, great. Our CD system works great with Jenkins.
We also support traceability across all of those things so we integrate deeply with them, really we focus more on making the system a set of composable parts that allow you to use it with your favorite tools.
Microsoft had a number of updates to this service announced today at the conference. One is release gates. Can you explain what they are?
Right. So let’s start with the problem. So in any DevOps process when you’re releasing to a large base of customers, you have to be very careful. First, you have to recognize that you will ship bugs. I don’t care what testing process you use, you will ship bugs. And so the question then becomes, “What are you going to do about the bugs that you ship?” Are you going to deploy out this new code to all one million users and boom, have 300,000 of them hit this bug and then you got 300,000 unhappy customers? Or are you going to deploy it to 1,000 customers and have 1,000 hit the bug and then fix it and then deploy it to the next thousand and sort of gradually roll it out so that you have sort of a controlled way of managing and managing the risk that you go?
So we formalize that notion of this gradual roll-up process into what we internally we use we call rings. We have Ring 0, Ring 1, Ring 2, Ring 3, and Ring 4. Ring 0 is just Microsoft. So when we deploy the Ring 0, we don’t affect anybody but us. And then once we’re happy there, then we roll to Ring 1 and go on from there.
Between each ring, we wait 24 hours. We deployed a ring, we wait 24 hours, we watch the health monitors, we look for any sign that something might be wrong. And if something is, we stop and fix it and then continue rolling out. It proceeds across all five rings this way.
Release gates are a formalization of that process. In our release management system, we already have the ability to find these kinds of environments that things flow across. What release gates allow you to do was automate the readiness checks. So I can define release gates for an environment that measure some KPI, that looks at something and says, “This release, this environment is not considered healthy unless the following is true.”
So you can configure the process to wait six hours in this environment. And in those six hours, you can set five metrics to monitor. And if at the end of those six hours, if all five metrics are green then it will roll on to the next release.
We introduced two sorts of release gates. One is an Azure monitoring release gate where you can hook it up to watch any Azure alert. And you set a threshold that says, “More than this is considered unhealthy.”
And then the second thing you can do is have it monitor or work on your query. And we use that. So let’s say a customer reports a problem in our 24-hour window. We get a call or we get an email or we get a bug file from our developer community site. It will get classified as a release blocking bug if it’s a serious bug. So our release gate would watch that query and if at the end of our 24 hours, if there’s any release blocking bug still active, it will hold the release.
So that’s sort of how you think about it. Now we also have three other release gates which are extensibility points. You could call out to an Azure Function. You can call out to any REST API anywhere or you can post a message on Service Bus and those then allow you to create whatever release gate you want. An example that I’ve used is I’m having my team write a sample which we’re going to open source and let everybody see how to do it that would monitor Twitter sentiment because it’s one of the things we watch. Today we watch it manually. We’re watching Twitter and we’ll block the release if we see Twitter light up with people having problems. But it’d be nice to have an indicator on the release that says, “Oh, since you release this, since you deploy this, like your Twitter sentiment has dropped significantly, let’s go look at that and figure out if that actually represents a problem.”
So that’s fundamentally what release gates are about. It’s automating sort of readiness checks on this gradual roll-out process.
What is the Git Virtual File System (GVFS)? What is it? Why did it come about? How could it be used by GitHub?
About three years ago, we were looking at our engineering systems. As part of that, we were looking at version control. We had a bunch of different version control systems internally. And the decision that we made after a much debate was that we wanted to move the entire company to git. That’s where we wanted to go. But at the same time, looking at the realities at Git, it was going to be impossible. We have some code bases that are just so large. The canonical one is the Windows code base which it is now in Git and it’s a 300GB repo, which in Git is unthinkable. It is so big. You got to remember, Git clones the whole repo down your machine and who wants 300GB coming down to your machine.
So we knew that to execute the strategic decision to move to Git, we really had to join the Git community and start helping evolve Git in the direction to be scalable. So we started that, and GVFS came out of that effort. Basically, it’s a combination of things. Part of it just performance tuning that makes Git better for everybody.
Specifically, GVFS is this virtualization layer that we added to Git that enables you, for example, to clone a sit repo so you don’t get everything. You only get sort of the metadata necessary to do basic operations. As you touch files, it will incrementally go back and download the pieces that you need. Windows is 5.5 million files. If I only ever work on 10,000 of them that I only get those 10,000 files.
The other dimension is depth. If I only ever use the current version of those 10,000 files, I only get the current versions of those 10,000 files. If I need to go back and access a version from six months ago, then when I need to do it I’ll go back to the git server, pull down the history that I need and now it will cache locally, so subsequent accesses are really fast, but they’ll only get the pieces that I need when I need it. Performance wise, it’s just dramatic. The differences are night and day.
I mean when we first started, if I tried to get cloned the Windows repo, it actually never succeeded. It just would time out. It wouldn’t work. Git status, which tells you what files you’ve changed, is normally an instantaneous operation. When we first started with the Windows repo, that was about an 18-minute operation to the Git status. A Git commit was like 30 minutes.
With GVFS and all the performance tuning work that we’ve done, git clone takes about a minute and a half of the Windows repo. Git status is about two and a half seconds. Git commit is about five or six seconds. Very reasonable numbers for sort of interactive behavior.
So we released that. GVFS is an open source project. Anybody can get access to it. We published the protocols. So that any Git service could implement it. The big announcement we made this week is partnering with GitHub to continue to drive this forward. GitHub is very clearly invested in Git and the future of Git. It wants to add GVFS support to GitHub and work with us to continue to advance Git and to help us accelerate the work to get Mac and Linux clients. We’ve done a Windows client. We are working on Mac client. We haven’t started our Linux client yet, though it’s in the roadmap. We’re planning to do it. GitHub is going to partner with us and help us get this done a lot faster. So we’ll work together as peers in the open source project to drive it forward.
Any other takeaways from this year’s Connect that we should keep in my mind?
We also announced Symbol Server support for Visual Studio Team Services. One of the problems developers face is they build an app and then they give it to somebody, and it crashes or it doesn’t work on your machine. It works on my machine. So I need to go to your machine to debug, but you don’t have the source.
So what Symbol Server gives you is the ability for me to go to your machine, attach a debugger to your running process, to inspect the process, look at the versions of the DLLs, go up to VSTS, find the symbols associated with it, download them, find the code associated with those symbols, download the files that you’re going to debug. I get a full rich source code debugging experience on your machine even though you never had any of the code on your machine. So that’s very cool. We now have Symbol Server built in as part of VSTS. We’ll also include it in TFS down the road. It’s not in TFS 2018 because it just hit the cloud and it will get into a future update of TFS.
Microsoft is a sponsor of The New Stack.