The Power of Hypervisor-Based Containers

The modern trend towards cloud-native apps seems to be set to kill hypervisors with a long slow death. Paradoxically, it is the massive success of hypervisors and infrastructure-as-a-service during the last 15 years that enabled this trend. Hypervisors introduced the tools that allow sysadmins and developers to deploy one (virtual) server per application efficiently. It is common practice throughout the industry with web servers, middleware, databases — they all get their separate VMs.
Cloud native apps are just the natural step forward. Users stopped thinking about VMs and Linux distros (taking them for granted) and started focusing on what matters to them: cloud applications. Large scale technology companies, like Google, quickly realized that as the Linux kernel is maturing, containers can provide significant application deployment and performance benefits. Docker popularized this concept by providing the right tools, with the right command line interfaces, at the right time.
Linux containers expose the full Linux system call interface to every application.
Docker’s greatest innovation was a simple framework that allows efficient application packaging, distribution and execution over a complex infrastructure of drivers, namespaces, and Linux kernel details that no developer wants to learn. All these capabilities have been part of the Linux infrastructure for a while, but there was no cohesive way of using them without deeply understanding all the details. Docker took something complex and abstracted it to something simple; an art in itself.
More importantly, Docker’s application packaging technology significantly simplified the dependency “hell” that continue to be a struggle for application developers. To some extent, it is the revenge of static binaries. Docker provided a simple mechanism to package all application dependencies in a single binary. Developers could stop worrying about library incompatibilities. Immutable infrastructure, the concept of replacing software with new images rather than doing upgrades, became viable. The final state of an application could always be well-defined, known and tested.
Docker uses Linux namespaces, also called containers, to execute applications packaged in the new format. In this model, all applications are run on the same Linux kernel, which is responsible for isolation and resource management. However, containers are only one way to run cloud native apps.
Hypervisors in the Container Era
Intel’s Clear Containers, Hyper’s RunV and VMware vSphere Integrated Containers (VIC) are hypervisor-based implementations of the Open Container Initiative (OCI) Runtime spec. They achieve higher isolation while maintaining the benefits of application packaging and immutable infrastructure. These technologies are capable of executing cloud-native apps, using existing hypervisors like KVM, Xen, and ESXi. They create a fresh new virtual machine for each application instance.
Hypervisors exploit hardware features available on current processors to provide stronger isolation with less attack surface. Technologies like Intel VT-x/VT-d and AMD-V provide hardware support for robust workload separation. Moreover, since hypervisors tend to be smaller than operating systems, they are affected by fewer vulnerabilities.
On the contrary, Linux containers expose the full Linux system call interface to every application: such a large surface of attack is tough to secure. It can be done, but it is a challenging task, one that requires detailed knowledge of the application running inside. This is never bulletproof. Just recently, it has been demonstrated how a common Linux kernel vulnerability, such as the Dirty Cow, can be used for privilege escalation by a malicious app from within a container.
The Challenge of Hypervisor-Based Containers
Hypervisors do have a cost: they introduce potential performance bottlenecks, especially on IO functions (network, storage), and have a larger memory footprint, since every container requires its own kernel instance. Additionally, hypervisor-based container deployments are often crippled by their inability to run on top of public clouds. For example, they cannot be used on top of Amazon AWS or Google Google Cloud Platform because they require hardware virtualization support, which is not available on public cloud virtual machines.
These trade-offs present us with a choice between convenience and flexibility on one end and isolation and security at the other end. We should not have to make that choice. It is time to re-think some of the fundamental assumptions that are leading to this dilemma.
A Re-examination and a New Approach with Hypervisor-Based Containers
X86 hypervisors were built to meet one key requirement: running multiple operating systems on the same machine. To achieve this goal, they created a “hardware abstraction layer.” The guest operating system is often unable to tell the difference between bare-metal and virtual machine. Hardware interfaces are small and secure but lead to overheads. IO functions like TCP/IP stacks and storage stacks are duplicated inside the virtual machine and the underlying hypervisor, sometimes conflicting with one another.
If we relax this prime requirement, if we think about multiple cloud-native apps, instead of multiples operating systems, we have the opportunity to revisit some of the core hypervisor design choices. We can make virtualization suitable for the container era.
Intel’s Nick Weaver Discusses Orchestration
Let us consider a hypervisor that does not present hardware abstractions to guests, but uses system call interfaces instead. Any time an application interacts with the kernel, it issues a system call. The guest kernel can serve the system call locally or proxy it to the underlying hypervisor for processing. With such an architecture, we don’t need to traverse the network stack twice, first in the guest, then in the host. The requests for network connectivity are directly passed to the host. At the same time, a memory allocation function or a privilege call can be confined within the boundaries of the guest operating system.
System calls are higher level and carry significantly more context than hardware devices. They are easier to virtualize and tend to be faster. Consider the network stack as an example. In a traditional hypervisor model, the hypervisor only sees IP packets with minimal context on the process that sends them. In a system call virtualization model, the hypervisor has as much context about the connection as the guest kernel itself, thus it can make more informed decisions. A small, carefully selected, set of system calls is enough to provide IO access to virtual machines running cloud native applications, which is the cause of the highest overhead of virtualization on modern hardware.
An attempt to achieve this type of abstraction is underway in the Xen Project community. It is referred as “PV Calls” (PV stands for “paravirtualization’) This approach is already demonstrating performance up to four times faster than traditional Xen Project networking. It shows significant promise in evolving hypervisors to efficient container runtimes.
Of course, the last piece of the puzzle is the ability to run cloud native apps anywhere, including Amazon AWS and Google Cloud. Some hypervisors do have this ability. Xen Project hypervisor provides two kinds of virtual machines: PV and fully virtualized machines (HVM). HVM virtual machines can exploit hardware virtualization support, and they are the most commonly deployed today. PV virtual machines are ancient by software standards. They predate virtualization support in hardware; they don’t use it or needed it. As such, they can run on Amazon AWS and other public clouds.
Together, PV Calls and Xen Project PV virtual machines give us a comprehensive solution to deploy cloud native apps, on bare-metal as well as IaaS, with the security properties of VMs and the speed and agility of containers.
Many people believe that if a technology is five years+, it is old. With this logic, some may believe that the hypervisor is dated, but it still has a few aces up its sleeves. The hypervisor might look different in the future, but I wouldn’t count this technology out just yet.