Google Investigates a New Approach for Workload Isolation

There’s a delicate balance between isolating workloads based on security requirements while still optimizing for compute and resource efficiency.
Machine isolation is a likely solution, but has had its limitations. Google Senior Staff Reliability Engineer Michal Czapiński and Google Site Reliability Engineering Manager Rainer Wolafka are investigating the way to overcome “the limitations of machine isolation.” In a report to Usenix, they present a new isolation method that they call “Workload Security Rings.”
Workload Security Rings (WSR) classifies the workloads by security requirements and then isolates and enforces each class at the machine boundary. This methodology still keeps sensitive and untrusted workloads on separate machines but introduces a new mid-level class between the two. Sensitive data is remains safe from hardware and software exploits such as zero-day and DDoS attacks but with higher resource utilization.
Czapiński and Wolafka came up with their novel approach in the Google production environment, but said “we believe this general technique will be applicable to other contexts such as Kubernetes.”
Czapiński and Wolafka are incredibly confident that Workload Security Rings provide a solution to the tradeoff of balancing compute requirements and security. The additional scheduling constraints that ask workloads of similar security requirements to form rings keep them from being co-scheduled with jobs of different levels of clearance.
What are Workload Security Rings?
In the simplest case, there are three classes of workloads:
- Sensitive Workloads are mission-critical or sensitive information. The classification is subjective since it ranges from fairly general to incredibly specific for each organization. From the hardening solutions listed below, a technique beyond sandboxing, called Binary Authorization, must be put in place for the data be safe and work correctly. Sensitive workloads can only run on trusted machines.
- Hardened Workloads include trusted data but not sensitive data. The rest of the hardened classification details relate to security controls that are put in place to prevent lateral movement within the cluster. This includes binary authorization, sandboxing, and other approaches. The primary concern with the hardened workload is its effect on other workloads. Hardened workloads are that “middle class” that can run on trusted and untrusted machines. Sensitive workloads can start new jobs on other machines but hardened jobs can’t.
- Unhardened Workloads are everything else which includes jobs running untrusted code. These workloads can only run on untrusted machines.
The hardened workloads fill in the resource utilization cracks that result from the scheduling constraints caused by the between the sensitive and unhardened jobs. The larger the hardened class is, the more resource fluctuations can be absorbed without the need to swap any machines from trusted to untrusted or vice versa.
As long as the hardened footprint is large enough, more workload classes can be added as is necessary. Each new class needs a new group of dedicated machines so the hardened class should keep up with appropriate sizing to continue absorbing the fluctuations and using resources effectively.
Czapiński and Wolafka are confident that WSR’s security “gives a strong guarantee that we will never co-schedule sense workloads with ones that are untrusted.” Though hardened workloads are potentially at risk, the ban on remote job creation makes it “prohibitively difficult” for an attacker to move across machines to the trusted pool.
Challenges
This isn’t a plug-and-play system and does require maintenance, the two warn. In attempts to avoid having to migrate machines from one group to another, Czapiński and Wolafka suggest weekly automatic rebalancing to account for a full seven-day cycle.
Here is the one exception to the security guarantee mentioned earlier. There is always a chance of a sudden load spike. This is the one instance that could lead to a temporary lift of scheduling constraints to prevent or mitigate an outage. This increases the risk of lateral movement between rings and “is not a decision to be taken lightly,” the duo writes.