Mizar: Scalable Multitenant Networking with XDP on Kubernetes
Futurewei sponsored this post.
Mizar is an open source project providing cloud networking to run virtual machines, containers, and other compute workloads. We built Mizar from the ground up with large scale and high performance in mind. Built in the same way as distributed systems in the cloud, Mizar utilizes XDP (eXpress Data Path) and Kubernetes to allow for the efficient creation of multitenant overlay networks with massive amounts of endpoints. Each of these technologies brings valuable perks that enable Mizar to achieve its goals.
With XDP, Mizar is able to:
- Skip unnecessary stages of the network stack whenever possible and transit packet processing to smart NICs.
- Efficiently use kernel packet processing constructs without being locked into a specific processor architecture.
- Produce very small packet processing programs (<4KB).
With Kubernetes, Mizar is able to:
- Efficiently program the underlying core XDP programs.
- Manage the lifecycle of its abstractions via CRDs.
- Have a scalable and distributed management plane.
- Deploy its core components and modules across all specified hosts.
Mizar’s Goals and the Limitations of Existing Solutions
Current flow-based programming solutions are not scalable and have a multitude of issues and quirks. Taking OVS/OVN as an example of the classical flow-based network programming solution, Mizar aims to overcome the following limitations:
- Have to program every host every time an endpoint is provisioned.
- Packets traverse multiple network stacks on the same host.
- Difficult to scale, creating 10,000 logical ports results in the generation of more than 40,000 port bindings.
- During flow programming, we generally see high CPU utilization.
- The time to provision ports scales linearly as the number of existing ports increases.
- Similarly, the provisioning time of containers is also dependent on the number of existing containers.
Along with overcoming these issues, Mizar also has the following goals:
- Support the provisioning of many endpoints (up to 10 million).
- Achieve high network routing throughput and low latency.
- Create an extensible plugin framework for cloud networking.
- Unify the network data plane for VMs, Containers, Serverless and others.
Mizar’s Overall Organization
Akin to networking done in the cloud, Mizar allows for partitioning of networks — via two new resources that we have introduced to Kubernetes. These are VPCs (Virtual Private Cloud) and within them, Subnets. In order to route traffic between endpoints (Pods) with this new organization scheme, we utilize XDP programs and in-network distributed hash tables to create the network functions known as Mizar’s Bouncers and Dividers. Bouncers operate within the subnet scope and are responsible for directing intranetwork traffic. Similarly, Dividers operate on the VPC level and are responsible for internetwork traffic. Furthermore, the configuration of these Bouncers and Dividers results in isolation of pod traffic, allowing for multitenancy and the reuse of the same address spaces across multiple tenants.
Mizar’s overall architecture can be separated into two categories: a management plane and a data plane. The management plane and the data plane communicate via an RPC interface exposed by the node daemon. In the following sections, we will talk about the key features and functionalities of each of these.
The Mizar Management Plane
Mizar’s management-plane utilizes Kubernetes’ Operator Framework, CRDs, and the Luigi to manage and configure the underlying XDP programs/data plane.
We have introduced six CRDs, and with each of them an operator exposing an interface to manage their lifecycle. The lifecycle of each object or custom resource is governed by three workflows: create, update and delete. Moreover, each of these workflows are triggered by state changes in their respective object. For example, the droplet object representing a physical interface on a node, will trigger the management plane delete workflow if the physical interface itself were to be removed.
The CRDs introduced by Mizar’s Management Plane are:
- VPCs, which have information about:
- Their own CIDR Range.
- Their VNI (Virtual Network Identifier).
- The number of Dividers they contain.
- Networks, which have information about:
- Their own CIDR range.
- Their parent VPC’s VNI.
- The number of Bouncers they contain.
- Dividers, which have information about:
- The VPC they belong to.
- Information about the interface that they have been provisioned on (IP, MAC, Name, etc.).
- Bouncers, which have information about:
- The network they belong to.
- Information about the interface they have been provisioned on (IP, MAC, Name, etc.).
- Endpoints, which have information about:
- The details about their own network interface (IP, MAC, Name, etc.).
- The Network they belong to.
- The VPC they belong to.
- The interface they have been provisioned on (IP, MAC, Name, etc.).
- Droplets, which have information about:
- The network interface they represent (IP, MAC, Name, etc.).
The information described in each of these CRDs is primarily used to program the data plane modules, which we will go over in the next section.
The Mizar Data Plane
The Mizar data plane consists of a set of XDP programs that are loaded onto network interfaces.
Alongside these XDP programs, we also run a daemon on each node to relay information between the data plane and the management-plane. The details of each of these programs are as follows:
- Transit Daemon
- This user-space process acts as the main relay between the management-plane and the data plane. The daemon’s main function is to configure and program the Transit XDP and Transit Agent, based on the information passed down by the management-plane. We also expose a CLI for communicating with the daemon directly, with its primary usage aimed at testing and troubleshooting.
- Transit XDP
- This XDP program is loaded onto the interface of a host and thus processes all ingress traffic directed to that interface.
- Transit Agent
- This XDP program is loaded onto the veth-pair of each endpoint (Pod) in the root namespace, processing all egress traffic from the interface.
For ingress traffic to a host, the Transit XDP program determines whether it should drop a packet, or redirect it to an endpoint (Pod) or user-space network function.
After decapsulation, in order to reach the pod, packets are redirected to the TX queue of the veth-pair of an endpoint. This entire process bypasses the host network stack in the root namespace entirely. Moreover, the destination to be redirected to is determined by the Transit XDP program, based upon the previous configuration pushed down from the management-plane into the data plane via eBPF maps. Furthermore, we may invoke another XDP program via a tail-call, or redirect the packet to an AF_XDP socket if a packet is destined for a network function.
Similarly, the Transit Agent processes and encapsulates all egress packets from an endpoint. After encapsulation, TX packets destined for another host are typically redirected to the main interface of a destination endpoint’s host, or a bouncer’s host. In the case that the destination endpoint is on the same host, the Transit Agent will invoke the Transit XDP program via tail call, which then redirects the packet to the veth interface of the cohosted destination endpoint.
Mizar’s current and future features are still under active development by both current team members and external university collaborators. Moreover, there are still many details and features that were not covered in this post. For more information about Mizar and to test it out yourself, please drop by our GitHub repo.
Mizar has come a long way since its original inception in 2019. As of this post, the Mizar Repo has garnered a total of 300+ commits, and alongside Arktos, made its debut in a talk at Kubecon 2020. Thank you to all contributors and collaborators for making Mizar what it is today.
Feature image via Pixabay.