7 Steps to Highly Effective Kubernetes Policies
You just started a new job where, for the first time, you have some responsibility for operating and managing a Kubernetes infrastructure. You’re excited about toeing your way even deeper into cloud native, but also terribly worried.
Yes, you’re concerned about the best way to write secure applications that follow best practices for naming and resource usage control, but what about everything else that’s already deployed to production? You spin up a new tool to peek into what’s happening and find 100 CVEs and YAML misconfigurations issues of high or critical importance. You close the tab and tell yourself you’ll deal with all of that … later.
Maybe the most ambitious and fearless of you will, but the problem is that while the cloud native community likes to talk about security, standardization and “shift left” a lot, none of these conversations deaden the feeling of being overwhelmed by security, resource, syntax and tooling issues. No development paradigm or tool seems to have discovered the right way to present developers and operators with the “sweet spot” of making misconfigurations visible without also overwhelming them.
Like all the to-do lists we might face, whether it’s work or household chores, our minds can only effectively deal with so many issues at a time. Too many issues and we get lost in context switching and prioritizing half-baked Band-Aids over lasting improvements. We need better ways to limit scope (aka triage), set milestones and finally make security work manageable.
It’s time to ignore the number of issues and focus on interactively shaping, then enforcing, the way your organization uses established policies to make an impact — no overwhelming feeling required.
The Cloudy History of Cloud Native Policy
From Kubernetes’ first days, YAML configurations have been the building blocks of a functioning cluster and happily running applications. As the essential bridge between a developer’s application code and an Ops engineer’s work to keep the cluster humming, they’re not only challenging to get right, but also the cause of most deployment/service-level issues in Kubernetes. To add in a little extra spiciness, no one — not developers and not Ops engineers — wants to be solely responsible for them.
Policy entered the cloud native space as a way to automate the way YAML configurations are written and approved for production. If no one person or team wants the responsibility of manually checking every configuration according to an internal style guide, then policies can slowly shape how teams tackle common misconfigurations around security, resource usage and cloud native best practices. Not to mention any rules or idioms unique to their application.
The challenge with policies in Kubernetes is that it’s agnostic to how, when and why you enforce them. You can write rules in multiple ways, enforce them at different points in the software development life cycle (SDLC) and use them for wildly different reasons.
There is no better example of this confusion than pod security policy (PSP), which entered the Kubernetes ecosystem in 2016 with v1.3. PSP was designed to control how a pod can operate and reject any noncompliant configurations. For example, it allowed a K8s administrator to prevent developers from running privileged pods everywhere, essentially decoupling low-level Linux security decisions away from the development life cycle.
PSP never left that beta phase for a few good reasons. These policies were only applied when a person or process requested the creation of a pod, which meant there was no way to retrofit PSPs or enable them by default. The Kubernetes team admits PSP made it too easy to accidentally grant too-broad permissions, among other difficulties.
The PSP era of Kubernetes security was so fraught that it inspired a new rule for release cycle management: No Kubernetes project can stay in beta for more than two release cycles, either becoming stable or marked for deprecation and removal.
On the other hand, PSP moved the security-in-Kubernetes space in one positive direction: By separating the creation and instantiation of Kubernetes security policy, PSP opened up a new ecosystem for external admission controllers and policy enforcement tools, like Kyverno, Gatekeeper and, of course, Monokle.
Tools that we’ve used to shed our clusters of the PSP shackles and replaced that with… the Pod Security Standard (PSS). We’ll come back to that big difference in a minute.
A Phase-Based Approach to Kubernetes Policy
With this established decoupling between policy creation and instantiation, you can now apply a consistent policy language across your clusters, environments and teams, regardless of which tools you choose. You can also switch the tools you use for creation and instantiation at will and get reliable results in your clusters.
Creation typically happens in an integrated development environment (IDE), which means you can stick with your current favorite to express rules using rule-specific languages like Open Policy Agent (OPA), a declarative syntax like Kyverno, or a programming language like Go or TypeScript.
Instantiation and enforcement can happen in different parts of the software development life cycle. As we saw in our previous 101-level post on Kubernetes YAML policies, you can apply validation at one or more points in the configuration life cycle:
- Pre-commit directly in a developer’s command line interface (CLI) or IDE,
- Pre-deployment via your CI/CD pipeline,
- Post-deployment via an admission controller like Kyverno or Gatekeeper, or
- In-cluster for checking whether the deployed state still meets your policy standards.
The later policy instantiation, validation and enforcement happen in your SDLC, the more likely a dangerous misconfiguration slips its way into the production environment, and the more work will be needed to identify and fix the original source of any misconfigurations found. You can instantiate and enforce policies at several stages, but earlier is always better — something Monokle excels at, with robust pre-commit and pre-deployment validation support.
With the scenario in place — those dreaded 90 issues — and an understanding of the Kubernetes policy landscape, you can start to whittle away at the misconfigurations before you.
Step 1: Implement the Pod Security Standard
Let’s start with the PSS mentioned earlier. Kubernetes now describes three encompassing policies that you can quickly implement and enforce across your cluster. The “Privileged” policy is entirely unrestricted and should be reserved only for system and infrastructure workloads managed by administrators.
You should start with instantiating the “Baseline” policy, which allows for the minimally specified pod, which is where most developers new to Kubernetes begin:
- name: my-container
The advantage of starting with the Baseline is that you prevent known privilege escalations without needing to modify all your existing Dockerfiles and Kubernetes configurations. There will be some exceptions, which I’ll talk about in a moment.
Creating and instantiating this policy level is relatively straightforward — for example, on the namespace level:
You will inevitably have some special services that require more access than Baseline allows, like a Promtail agent for collecting logs and observability. In these cases, where you need certain beneficial features, those namespaces will need to operate under the Privileged policy. You’ll need to keep up with security improvements from that vendor to limit your risk.
By enforcing the Baseline level of the Pod Security Standard for most configurations and allowing Privileged for a select few, then fixing any misconfigurations that violate these policies, you’ve checked off your next policy milestone.
Step 2: Fix Labels and Annotations
Labels are meant to identify resources for grouping or filtering, while annotations are for important but nonidentifying context. If your head is still spinning from that, here’s a handy definition from Richard Li at Ambassador Labs: “Labels are for Kubernetes, while annotations are for humans.”
Labels should only be used for their intended purpose, and even then, be careful with where and how you apply them. In the past, attackers have used labels to probe deeply into the architecture of a Kubernetes cluster, including which nodes are running individual pods, without leaving behind logs of the queries they ran.
The same idea applies to your annotations: While they’re meant for humans, they are often used to obtain credentials that, in turn, give them access to even more secrets. If you use annotations to describe the person who should be contacted in case of an issue, know that you’re creating additional soft targets for social engineering attacks.
Step 3: Migrate to the Restricted PSS
While Baseline is permissible but safe-ish, the “Restricted” Pod Security Standard employs current best practices for hardening a pod. As Red Hat’s Mo Khan once described it, the Restricted standard ensures “the worst you can do is destroy yourself,” not your cluster.
With the Restricted standard, developers must write applications that run in read-only mode, have enabled only the Linux features necessary for the Pod to run, cannot escalate privileges at any time and so on.
I recommend starting with the Baseline and migrating to Restricted later, as separate milestones, because the latter almost always requires active changes to existing Dockerfiles and Kubernetes configurations. As soon as you instantiate and enforce the Restricted policy, your configurations will need to adhere to these policies or they’ll be rejected by your validator or admission controller.
Step 3a: Suppress, Not Ignore, the Inevitable False Positives
As you work through the Baseline and Restricted milestones, you’re approaching a more mature (and complicated) level of policy management. To ensure everyone stays on the same page regarding the current policy milestone, you should start to deal with the false positives or configurations you must explicitly allow despite the Restricted PSS.
When choosing between ignoring a rule or suppressing it, always favor suppression. That requires an auditable action, with logs or a configuration change, to codify an exception to the established policy framework. You can add suppressions in source, directly into your K8s configurations or externally, where a developer requests their operations peer to reconfigure their validator or admission controller to allow a “misconfiguration” to pass through.
In Monokle, you add in-source suppressions directly in your configuration as an annotation, with what the Static Analysis Results Interchange Format (SARIF) specification calls a justification:
monokle.io/suppress.pss.host-path-volumes: Agent requires access to back up cluster volumes
Step 4: Layer in Common Hardening Guidelines
At this point, you’ve moved beyond established Kubernetes frameworks for security, which means you need to take a bit more initiative on building and working toward your own milestones.
The National Security Agency (NSA) and Cybersecurity and Infrastructure Security Agency (CISA) have a popular Kubernetes Hardening Guide, which details not only pod-level improvements, such as effectively using immutable container file systems, but also network separation, audit logging and threat detection.
Step 5: Time to Plug and Play
After implementing some or all of the established hardening guidelines, every new policy is about choices, trust and trade-offs. Spend some time on Google or Stack Overflow and you’ll find plenty of recommendations for plug-and-play policies into your enforcement mechanism.
You can benefit from crowdsourced policies, many of which come from those with more unique experience, but remember that while rules might be well-intentioned, you don’t understand the recommender’s priorities or operating context. They know how to implement certain “high-hanging fruit” policies because they have to, not because they’re widely valuable.
One ongoing debate is whether to, and how strictly to, limit the resource needs of a container. Same goes for request limits. Not configuring limits can introduce security risks, but if you severely constrain your pods, they might not function properly.
Step 6: Add Custom Rules for the Unforeseen Peculiarities
Now you’re at the far end of Kubernetes policy, well beyond the 20% of misconfigurations and vulnerabilities that create 80% of the negative impact on production. But even now, having implemented all the best practices and collective cloud native knowledge, you’re not immune to misconfigurations that unexpectedly spark an incident or outage — the wonderful unknown unknowns of security and stability.
A good rule of thumb is if a peculiar (mis)configuration causes issues in production twice, it’s time to codify it as a custom rule to be enforced during development or by the admission controller. It’s just too important to be latently documented internally with the hope that developers read it, pay attention to it and catch it in each other’s pull-request reviews.
Once codified into your existing policy, custom rules become guardrails you enforce as close to development as possible. If you can reach developers with validation before they even commit their work, which Monokle Cloud does seamlessly with custom plugins and a development server you run locally, then you can save your entire organization a lot of rework and twiddling their thumbs waiting for CI/CD pipeline to inevitably fail when they could be building new features or fixing bugs.
If you implement all the frameworks and milestones covered above and make all the requisite changes to your Dockerfiles and Kubernetes configurations to meet these new policies, you’ll probably find your list of 90 major vulnerabilities has dropped to a far more manageable number.
You’re seeing the value of our step-by-step approach to shaping and enforcing Kubernetes policies. The more you can interact with the impact of new policies and rules, the way Monokle does uniquely at the pre-commit stage, the easier it’ll be to make incremental steps without overwhelming yourself or others.
You might even find yourself proudly claiming that your Kubernetes environment is entirely misconfiguration-free. That’s a win, no doubt, but it’s not a guarantee — there will always be new Kubernetes versions, new applications and new best practices to roll into what you’ve already done. It’s also not the best way to talk about your accomplishments with your leadership or executive team.
The advantage of leveraging the frameworks and hardening guidelines is that you have a better common ground to talk about your impact on certification, compliance and long-term security goals.
What sounds more compelling to a non-expert:
- You reduced your number of CVEs from 90 to X,
- Or that you fully complied with the NSA’s Kubernetes hardening guidelines?
The sooner we worry less about numbers and more about common milestones, enforced as early in the application life cycle as possible (ideally pre-commit!), the sooner we can find the sustainable sweet spot for each of our unique forays into cloud native policy.