Despite assurances from the purveyors of “frothy container orchestration systems,” persistent storage will remain a roadblock for containers for the foreseeable future, argued James Bottomley, an IBM container evangelist, speaking at the recent Linux Foundation Vault storage conference held in Boston last month.
Right now, a major roadblock to stateful storage of containers is the inability, under current Linux-y architectures, to reconcile the file system user ID (fsuid), used by external storage systems, with the user IDs (uids) created within containers. They can not be reconciled in any way that can be both safe and maintainable without loss of coherence of either the system or the system administrator.
The Linux kernel community has lent a helping hand, through the creation of a superblock user namespace, though the superblock user namespace does not provide the full solution, and the container companies still must do a considerable amount of work to lock themselves into a solid Unix-based security model.
The Name Game
The way Linux user namespaces work here is key. It is received wisdom that the container security relies on the user namespaces, but this is a partial truth, Bottomley asserts. Security systems by default are supposed to lock someone out of a system. Namespaces actually do the opposite: They give enhanced privileges to users.
Certain root operations are permitted to the user within the user namespace that would not be permitted outside the user namespace. For instance, you can run the mount command from within a namespace, but not as a regular user.
A user namespace basically does a mapping from a set of exterior uids to interior uids. The exterior uid is what the kernel sees for the user. The interior uid is what the user inside the container sees.
Attempting to fit the two worlds together has led to some twisted usage, in Bottomley’s view. Container orchestration systems must be run, mostly, in root (this is why Bottomley characterizes most of the orchestration systems as “toxic”).
“A lot of bugs and problems we’ve had with user namespace have been through this enhancement of use,” Bottomley said. Privilege escalation vulnerabilities happen, for instance, because some networking utility functions are in the namespace. (He also did not have kind words for Docker’s implementation of the user namespaces, which wasn’t added until Docker 1.10, and was done so “poorly,” in his estimation.)
So by default, user namespace operations are privileged operations. Running unprivileged containers is a goal in the industry, though especially with cloud deployments, “we are way, way, away from being there yet,” Bottomley said.
One of the significant problems that writing containers out to storage is that storage systems have their own uids, the above-mentioned fsuid, which corresponds mostly to the kernel uid assignments to its users. It is the kernel that translates the uids to fsuids. Root inside a container will have a uid of 0, though the kernel will translate that to the container owner’s uid, say 1000, and that uid 1000 will be what the fsuid records as well, which everyone on that system will see for that user (except for the actual user, which will still see uid 0).
So far so good. But sharing containers presents the true dilemma here. And sharing, with services like Docker Hub, is much of the value proposition of containers. “This is a significant problem for running unprivileged containers alongside standard images,” Bottomley said. “If we’ve all written out container images for different values of root, it’ll be a horrible nasty mess somewhere.”
Human sacrifice, dogs and cats living together, mass hysteria, as Bill Murray might say.
Any operating system planted in a container needs to have root at uid 0 to operate. But under the current, unaltered setup, any operation within the container that touches the file system in any manner will get a permission-denied error. “I’ll see all these IDs on this image as belonging to nobody because I don’t have a mapping for them,” he said.
So! Containers need privileges. Yet, giving containers real root — not username root — gives enterprises the heebie-jeebies, because if someone can break out of the container they would have full root of the host. Or even if, due to some misconfiguration, they only have access to a certain external directory, they can still do root-level damage to everything in that directory, delete files and so on.
Ideally, we would need a way to transfer the userspace uid info between the kernel and the file system, eliminating the risks that come along with handing out root to containers willy-nilly.
Released in October, Linux 4.8 came with one potential solution, the inclusion of super block user namespace (often known by its variable name: s_user_ns), created by Red Hat software engineer Eric Biederman. The s_user_ns retains the uids from the container user namespace into the fuid space. “This gives me the capability of safely executing files and modifying files on privileged image,” Bottomley said.
And yet, s_user_ns has a pretty severe limitation, namely that it can only be used on superblocks, the basic unit of allocation from the file system, Bottomley explained. Containers, to work their magic, need to be bind-mounted, which replicates a directory tree. Bind-mounting, however, can’t typically be done with super block allocation.
Bottomley’s own hack to make this work is ShiftFS, which, in a nutshell, would allow bind-mounting to super blocks. If you don’t have to share images, then you can use your own permissions, but an approach like ShiftFS would be needed for “if I want to create a file system image I want to share with other people.”
There is a security danger with this approach. With the ability to write any file system as root, this mechanism must be tightly controlled, work for which must still be completed. “But hey, you all trust your container orchestration people, right? What could go wrong?” Bottomley said.
Feature image via Pixabay.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.