Linux cgroups v2 Brings Rootless Containers, Superior Memory Management
Containers and container management tools have a lot of moving parts. Although you could very quickly deploy a single Docker container without much thought, the larger you scale up that container and the more services you add to it, the more complicated it becomes. In fact, Kubernetes deployments can very quickly become incredibly complex. They can also become very demanding on resources.
One part of the moving picture of containers is cgroups. Originally created by Google, and incorporated into the Linux kernel 2.6.24, cgroup stands for “control group” and is a means to manage how much computational resources used by a set of processes (i.e. a container). With cgroups you can do things like isolate core workloads from background tasks, prevent one workload from overpowering other workloads, and much more.
Up until recently, container developers have been using cgroups v1. However, cgroups v2, available as of the 4.5 version of the kernel, is now available and supported by most container deployment systems. This new version includes a number of important changes that container developers will want to know about.
The biggest change to cgroups in v2 is a focus on simplicity to the hierarchy. Where v1 used independent trees for each controller (such as /sys/fs/cgroup/cpu/GROUPNAME and /sys/fs/cgroup/memory/GROUPNAME), v2 will unify those in /sys/fs/cgroup/GROUPNAME. In the same vein, if Process X joins /sys/fs/cgroup/test, every controller enabled for test will control Process X.
For example, in cgroups v2, memory protection is configured in four files:
- memory.min: this memory will never be reclaimed.
- memory.low: memory below this threshold is reclaimed if there’s no other reclaimable memory in other cgroups.
- memory.high: the kernel will attempt to keep memory usage below this configuration.
- memory.max: if memory reaches this level the OOM killer (a system used to sacrifice one or more processes to free up memory for the system when all else fails) is invoked on the cgroup.
Rootless containers have become a very popular means to prevent runtime vulnerabilities in containers. Why rootless containers? With this added security layer, if a container is compromised, the attacker won’t be able to gain root privileges on the host. Rootless containers also allow isolation between nested containers. The problem to date has been that cgroups v1 did not support imposing resource limitations on rootless containers. That all changes with cgroups v2, as rootless containers will now include the resource limitation feature.
Other changes found in cgroups v2 include the likes of:
- Cgroup controllers now negotiate with subsystems before problems can actually occur. Those subsystems are also capable of taking action to remediate the problems.
- Global inotify support.
- Single unified hierarchy means no sync is required.
- More upfront design.
- Universal thresholds.
It’s important to know that most high-level container runtimes (Containerd, Docker, Podman, and Kubernetes) are now capable of fully supporting cgroups v2. Most of this support came into being as of Nov. 2019, but with cgroups v1 being deprecated, it’s time to start making the challenging migration from v1 to v2.