A decade ago, Microsoft’s publicly stated policy towards Linux was to compel its vendors, by any and all means, to obtain paid licenses for whatever Windows technologies they may, directly or indirectly, have stolen. Last week at DockerCon 2017 in Austin, two of Microsoft’s busiest engineers demonstrated how they are rewiring the schematics of Windows Server 2016 (due for a major update in just a few weeks’ time), enabling it to manage and run Linux-based containers on a Linux subsystem within Windows.
“We wanted to have one layer that was kind of a, ‘Here’s the entry point to all things,’” described Taylor Brown, Microsoft’s principal lead program manager.
Technically, Brown was explaining why Microsoft intentionally redesigned its Hyper-V virtualization system to include something called Host Compute Service (HCS). His team had observed how, in Linux systems, multiple permutations of container systems were simultaneously pinging the same control group (cgroup) interfaces.
“What we feared was, someday you’re going to have Docker running next to rkt next to some other thing,” he continued, “and there’s going to be no common way to be talking about these things at all.”
The Entry Point to All Things
Soon after Microsoft premiered its Docker support two years ago at its own company conference, it presented our first glimpse at a scheme for Windows and Linux interoperability. In August 2015, Taylor Brown explained why Microsoft chose to produce two implementations of containers: one just branded “Windows containers,” and the other “Hyper-V containers.” For security purposes, he said, you may have a need to run containers in perfect isolation, and Hyper-V provides that.
Of course, isolation was the original idea behind the creation of cgroups in Linux anyway. So there was a lingering question over why a Windows dev or admin wouldn’t want Hyper-V implementation in every case.
Tuesday, there came a long, thorough, and somewhat necessarily circuitous, response to that question.
“When you run a Windows Server container, that is a shared kernel,” explained Brown, being careful now to include the word “Server” in the phrase for reasons that will soon become obvious. “If I run a second one of those, same kernel. They also share the same kernel with the host.”
As it turns out, enabling this kernel sharing runs contrary to the architecture of Windows 10, Microsoft’s client-side OS.
“Even though it’s the same kernel between Windows 10 and [Windows] Server,” said Brown, in response to an attendee’s question, “they operate differently. You get different scheduling parameters, you get different memory management techniques. So if you were trying to get to a state where you could say definitively, ‘This is going to run the same way,’ those will interfere with and change the way things work.”
Put another way: If the Docker-inspired methodology of sharing kernels were to be applied to both Windows 10 and Windows Server, then most every effort you would make to try to balance the performance characteristics of a container across both environments would lead to imbalance. This is a problem Microsoft has encountered before, specifically with its long-standing web server, IIS. Beginning with version 6.0 (released in 2003), it uses a kernel-mode device driver HTTP.SYS — in Windows parlance, a library intended only for use at the base layer of Windows. This was Microsoft hard-wiring the core of its Web server into its operating system.
Although Microsoft unified its kernel for client and server OSes, their support structures have diverged greatly. Their concepts of time and process scheduling are now based on separate constructs. As a result, IIS behaves dramatically differently for both OSes. Ensuring consistent behavior across all supported implementations is now a requirement for any containerization platform. So despite documentation published only months ago explaining how to run both container types on Windows 10, going forward, said Brown, the client OS will only run Hyper-V containers.
“That image would work differently in those different environments,” continued Brown, “which is kind of counter to the entire spirit and goal of what we’ve done with Docker. Which is why we’ve implemented it the way that we did, at least for now.”
The Entire Spirit and Goal
If it were Docker’s idea from the beginning to construct two platforms that could run workloads identically on two operating systems, the project might never have been completed. From an architectural standpoint, it’s not much different from constructing a skyscraper on an island and an identically functioning one on an offshore platform.
Suffice it to say, under the hood, the two Dockers don’t work alike. What’s more, Microsoft has made significant changes to Windows Server to enable any kind of containerization to happen at all.
The crux of Microsoft’s changes has to do with networking. Whereas the virtual switch, or vSwitch, has played very little of a role in the admittedly Linux-leaning New Stack so far, and no role whatsoever in Docker for Linux, it is critical to how Windows perceives a virtual network. All Docker containers’ connections take place over an IP network; and in Linux, the model for those connections is, simply put, Linux Routing. Just as simply, there is no Linux Routing in Windows. So Microsoft’s and Docker’s engineers had to gear the Windows version of the container platform for the vSwitch.
Back in the era of Windows Vista and the “Longhorn” project, Microsoft’s Windows engineers created a concept called IP routing compartments. Its original intent was to guarantee isolation for routing within a virtual private network (VPN). Docker networking on Windows does not use a VPN; however, it does leverage the compartment concept. Hyper-V leveraged it first to enable multiple virtual servers to run on a physical server. Specifically, compartmentalization ensures that each virtual server is isolated from all the others.
If you recall how Linux containers evolved, the concept began with the cgroup, which led to isolated namespaces. Giving each of these spaces network isolation with designated IP addresses, then led to containers as Linux developers understand them. With Windows, the evolution ran the opposite direction, at least as Microsoft principal engineering lead Dinesh Govindasamy explained it last Tuesday at DockerCon, in the session with Brown. First, Hyper-V gave Windows engineers the network isolation component that any container would need. From there, these engineers had to build a way to share the kernel, to share resources over an IP network, and to incorporate subsystems that would, in turn, enable Linux containers to run in Windows.
So even though software-defined networking (SDN) has been a thing for quite a while, Windows Server will only now begin officially supporting overlay network mode — a critical feature for implementing the Docker Swarm orchestration engine — when the latest update patch ships in just days, announced Govindasamy.
“The Docker networking architecture is built upon this Container Networking Model,” he went on, presenting a diagram that looks like a stack of pancakes cooked for Piet Mondrian.
For this architecture to apply to analogous Windows components, there needed to be a way for an application to securely build out the network — something Windows didn’t really have.
So Govindasamy and his colleagues devised a component called Host Networking Service (HNS, the networking counterpart to HCS). Docker for Windows relies upon HNS to create vSwitches and virtual firewalls (what Windows calls WinNAT). HNS also establishes network endpoints, binds them to vSwitch ports, and applies policies to the endpoints, he explained to attendees.
“The default network mode we have in Linux is bridge mode. And the default network mode we have in Windows is NAT mode,” he continued, illustrating with a very broad brush the fact that the two operating systems have entirely different networking priorities. In a sense, they have contrasting personalities; Linux wants to connect, while Windows wants to abstract.
“For NAT mode, we create an internal vSwitch,” explained Govindasamy. “[It’s] a private vSwitch with an addition of a gateway NIC [virtual network interface card] to the host partition. Then we create a NAT between the gateway NIC and the external NIC. So any containers you added to this NAT network should be able to talk to each other because of the vSwitch, and any traffic that’s going out of this NAT network will be NATted using WinNAT.”
Why does all this matter? There are a few reasons, which I promise are interesting.
First, while it has been the goal of container orchestration to present consistent methodologies across platforms, the adaptations made to Kubernetes to enable Windows container networking are certain to result in different performance profiles. Not necessarily worse, perhaps even better, but certainly different.
And that leads to the other key reason why this matters: While the original Linux Routing architecture for Docker could not be translated into Windows, theoretically, there’s nothing stopping Microsoft’s vSwitch-oriented methodology from, at some point, being translated into a Linux implementation, if only experimentally. Open vSwitch is no longer a VMware project; it’s now stewarded by the Linux Foundation, of which Microsoft is a member.
Furthermore, during Tuesday’s keynote session at DockerCon where he introduced the Moby Project, Docker Chief Technology Officer Solomon Hykes spoke of the need for standardizing Docker’s chassis — the way an automobile manufacturer does — in order to enable easier specialization for various platforms.
“Obviously they re-use the same individual parts — the same wheels, engines, etc.,” said Hykes, “but they also collaborate on common assemblies of these components. And that allows them to not duplicate effort… So we stole that idea, and we applied it to this engineering problem of ours. And the result is something that worked really well. We created within Docker a place where all of our teams could collaborate… on common assemblies.”
Networking is fundamental to the way Docker operates, so it’s impossible to imagine that Docker’s engineers have somehow avoided considering the possibility of adopting Microsoft’s vSwitch/vNIC/NAT model or perhaps applying some standardized form of the model across all its editions, including Linux, at some future date. Thus, the problem Microsoft solved for enabling Docker on Windows Server may very well have pointed the way toward a future for Docker everywhere else.
After all, an ecosystem abhors imbalance.
Feature image: A Roberval balance scale, taken by Nikodem Nijaki, licensed under Creative Commons 3.0.