Recently, The New Stack published an article titled “Containers and Storage: Why We Aren’t There Yet” covering a talk from IBM’s James Bottomley at the Linux Foundation’s Vault conference in March. Both the talk and article focused on one of the central problems we’ve been working to address in the Cloud Foundry Foundation’s Diego Persistence project team, so we thought it would be a good idea to highlight the features we’ve added to mitigate it. Cloud Foundry does significantly better than what the article suggests is the current state of the art on the container security front, so we’ll cover that here as well.
UID to FSUID Mapping
As the article puts it:
Right now, a major roadblock to stateful storage of containers is the inability, under current Linux-y architectures, to reconcile the file system user ID (fsuid), used by external storage systems, with the user IDs (uids) created within containers. They can not be reconciled in any way that can be both safe and maintainable without loss of coherence of either the system or the system administrator.
We ran headlong into this issue in mid-2016 as we started to consider customer use cases for network attached storage in Cloud Foundry. For Cloud Foundry applications, the problem is perhaps even more acute than it would normally be for applications running in other container orchestrators because Cloud Foundry is opinionated about the way that applications run, and it runs all buildpack applications as UID 2000. If we wanted to mount storage into the container using regular Linux kernel mounts, that would mean:
- Any attached file system would need to have access opened to UID 2000.
- Files created by any application deployed to Cloud Foundry would appear to the filesystem as owned by UID 2000, with no way to distinguish after the fact which app did what.
We looked at a couple of different ways to solve the issue (including a brief moment of lamentation that ShiftFS wasn’t available in the kernel yet). Cloud Foundry applications run in a user namespace, so UID mapping is available, just as it is in Docker. The trouble is that we only get one mapping per container runtime, so while we could easily remap UID 2000 on the container to some other UID on the host, we’d just end up with all the apps connecting as that new UID. What we needed was a way to effect a per-application UID mapping.
We eventually came up with a solution: to inject UID mapping into the NFS mount itself using fuse mounts and NFS client. Since the Network File System (NFS) is just a Remote Procedure Call (RPC) protocol, it is relatively straightforward to inject mapping logic in between the NFS interface and the fuse implementation. This allows applications to declare a UID that is meaningful to the remote NFS server and makes traffic to the remote server appear as that UID while the same files present to the container and container host as the standard UID 2000 that we use to run container applications.
Arguably, we are less affected by these issues because our use of containers is more focused.
In the most recent release of our NFS plugins, we also support optional Lightweight Directory Access Protocol (LDAP) server configuration, so customers can use Active Directory Federation Services (ADFS) to resolve login credentials into UIDs, closely mirroring what they do when using NFS shares from non-Cloud Foundry applications.
To be sure, this solution only partially solves the broader issue of UID mapping from container based applications. Today it only works for NFS shares and, while it allows us to provide persistent storage for containers, it doesn’t do anything to provide persistent storage of containers. But in the Cloud Foundry context, container state is expected to be ephemeral, so storage of container state is an anti-pattern.
The article also highlights some potential security issues when running applications in containers, particularly when applications want to attach external storage since mount operations require CAP_SYSADMIN:
The way Linux user namespaces work here is key. It is received wisdom that the container security relies on the user namespaces, but this is a partial truth, Bottomley asserts. Security systems by default are supposed to lock someone out of a system. Namespaces actually do the opposite: They give enhanced privileges to users.
Certain root operations are permitted to the user within the user namespace that would not be permitted outside the user namespace. For instance, you can run the mount command from within a namespace, but not as a regular user.
But in Cloud Foundry, container applications seldom run as root. As mentioned above, buildpack applications (the standard approach to deploying applications to Cloud Foundry) always run as an unprivileged user with UID 2000. It is possible to run Docker applications in Cloud Foundry, in which case the default behavior is to run as root, but even in that case, CAP_SYS_ADMIN capability is only used upfront by the Garden runtime to bind storage and networks into the container and is then revoked before the application process is started.
Differences in Perspective
It’s worth noting that our goals for containers in Cloud Foundry are quite different from Bottomley’s original use case when he devised ShiftFS. The limitations in container technology that Bottomley calls out in his talk are very real, and for the most part, we have not solved them.
But arguably, we are less affected by them because our use of containers is more focused. Most of the drawbacks in today’s container technologies come from running processes in containers that want to do root-like things such as attaching networks, mounting devices and the like. Allowing root-like things to occur inside a container while simultaneously keeping that privilege contained inside the container is much harder than using containers simply to enforce isolation and consistency of interface.
Since Cloud Foundry manages and provides volume mounts and network interfaces outside of the context of container-based applications, it’s less susceptible to the security pitfalls inherent to running container applications with privilege. And as noted earlier, achieving usable external storage for container-based applications is much easier than providing storage for the containers themselves.
Whether or not we’re there yet depends on where we were trying to go.
The Cloud Foundry Foundation is a sponsor of The New Stack.
Feature image via Pixabay.