5 Best Practices for Securing Kubernetes Runtime Monitoring

To secure their Kubernetes and container environments, organizations often prioritize pre-deployment, or “build time,” configurations to ensure that workloads use the least amount of privileges, implement good security policies and leverage least-privileged Kubernetes role-based access control (RBAC). The goal is to minimize the Kubernetes attack surface, the ability for a threat to move laterally, and the size of any breach.
But companies cannot simply hope that their build time configurations are 100% secure. They also need “runtime” visibility to understand how their running Kubernetes clusters and container configurations might drift from their initial state, and identify what manual changes are being made.
More importantly, organizations need runtime detection to promptly detect cyberthreats and exploits of vulnerabilities and misconfigurations at all layers of their K8s and container environment. Also, runtime logging, monitoring and detection is essential for compliance with regulations and frameworks such as SOC 2, NIST 800-53, CIS Benchmark for Kubernetes, PCI DSS, and ISO 27001.
Per the Red Hat 2021 State of Kubernetes Security Report: “Nearly everyone — 94% of respondents — admitted to experiencing a security incident in the last 12 months… a sizable portion also identified a major vulnerability, experienced a runtime incident, or failed an audit. These findings become more critical when respondents have deployed their Kubernetes workloads in production environments.”
Here are five security best practices for Kubernetes runtime monitoring and detection.
1. Monitor Network Connections
It is critical to map and monitor all network connections, both internal (east-west) and external (north-south). This provides an understanding of how Kubernetes workloads and namespaces interact with each other and what external resources (e.g., cloud services, external APIs) are being accessed. This shows the entire footprint of a Kubernetes cluster and can be used for both threat detection and to meet compliance requirements that require the mapping of all internal and external network traffic.
Understanding what normal network traffic looks like enables a runtime monitoring solution to detect abnormal behaviors, such as operational issues that lead to an increase of errors in east-west traffic or too many calls to an external API blocked by the provider. Abnormal network connections can also be caused by a threat in the environment. An example threat might be a rogue container or compromised application performing a network or port scan to discover the environment or attacking other internal Kubernetes or cloud services (lateral movement).
It is important to monitor access to any known malicious IPs and domains as indications of compromise (IoC). Many attacks toward Kubernetes have resulted in cryptominer software being installed in containers. Network connections would show connections to new IPs and domains, often already known as malicious, to download the cryptominer binary, connect to a crypto pool, and send information back to the attacker.
Attackers can also probe the environment to leverage misconfigurations — for example, non-production clusters accessing sensitive production services such as databases or vaults — or plain-text HTTP traffic sent to public endpoints.
Kubernetes does not offer any out-of-the-box tools to show network connections. An effective monitoring network security tool should show the connections coming from the nodes and each workload, referring to their full identity (i.e., namespace, workload name, pod name) rather than internal IP address.
2. Monitor Ingress Endpoints
One of the top concerns for many DevOps teams is accidentally exposing an internal service to the internet. It’s easy to add an external load balancer or ingress in Kubernetes, and unintentionally expose a service that lacks the proper authentication and authorization required for public endpoints. This happened to a large vehicle manufacturer recently when their Kubernetes dashboard was accidentally exposed to the internet, letting attackers create and launch new pods through the web interface.
Any addition of an external load balancer, ingress controller, nodeport, etc. should be logged and validated. First connections from public IPs to a workload should also be raised to ensure that accidental exposure does not occur. The network map could be used to determine if any of the public services have direct access to sensitive resources such as private AWS S3 buckets, internal databases, or vaults.
3. Monitor Kubernetes Audit Logs
One of the Kubernetes best practices is to keep the container immutable, and use Infrastructure as Code (IaC) to define your environment prior to deployment. This means manual activities, such as logging to a container (kubectl exec), should be minimized. Forbidden activities should include actions like the manual deletion of resources or changes to Kubernetes RBAC (roles, cluster roles, role bindings, etc.).
It’s difficult to detect manual changes by only looking at a Kubernetes cluster configuration, and impossible to tie it back to a user. The solution is to enable the Kubernetes audit logs, also called Kubernetes control plane logs, that record all Kubernetes API calls including user activities and automated workflows. But these logs are very noisy. Millions of logs can be generated daily, mostly for normal, internal activities.
You need a solution that can extract the interesting events out of all this noise, and point out unusual activities such as:
- Container login methods, such as kubectl exec, kubectl attach, and kubectl log
- A network exposure of a container that can bypass any Kubernetes network policy, such as kubectl port-forward
- Any change or addition to roles, role bindings, cluster roles, or cluster roles bindings
- Manual change to a container images, including installation of packages and libraries
Changes to RBAC are difficult to track outside of the audit logs. A typical Kubernetes setup comes up with over 20 default roles and 75 cluster roles. It’s nearly impossible to spot changes to these existing roles, or any unintentional or malicious binding. It’s much easier to keep an audit trail of these changes in order to validate them later.
The Kubernetes audit logs also show failed API calls due to lack of authentication, missing permissions, or other types of configurations. This would detect both operational issues and security issues.
4. Monitor Deployment of New Kubernetes Components and Workloads
Many of the known Kubernetes breaches resulted in rogue containers, typically cryptominers, being deployed in the breached environment. These attacks were not discovered for months, showing how difficult it is to spot one malicious pod amongst thousands. The attacker can simply use a vaguely familiar name to be undetected.
Most configuration-based security solutions, such as the Pod Security Policy or Open Policy Agent (OPA) Gatekeeper, detect unsafe attributes being used by workloads: privileged, running as root, unsafe capabilities, etc. A rogue container using none of these privileged accesses in Kubernetes will not be detected.
Workload runtime monitoring needs to go deeper. It must be able to detect new registries being used, even new repositories (especially for public registries such Docker Hub and Quay). It should detect workloads created by new users, outside of the continuous deployment (CD) workflow put in place at most companies. This would also detect out-of-band deployments of workloads, containers, or clusters that may not have been validated by the regular continuous integration (CI) workflow, potentially bypassing security checks implemented earlier in the application lifecycle.
5. Use Automated, Machine Learning-based Threat Detection
Let’s assume the prior best practices have been implemented and you are collecting all the events needed to monitor the runtime environment across all layers: K8s, containers, and workloads. The final best practice is to leverage automated, machine-based threat detection to detect unknown or advanced threats that hide in the sea of events generated at all layers.
This technology first monitors activity in your Kubernetes and container environment to determine and baseline what is “normal.” This includes monitoring activity around applications, APIs, files, processes, users, networks, and more, to understand what are “normal” events like consistent connections to certain IPs over time, health checks, pod restarts, and workload rescheduling. Secondly, anomaly detection identifies deviations off of the baseline that are abnormal and represents unknown threats. Examples of this anomalous activity were covered in the prior sections. Without machine-learning-based detection, manual detection rules will miss unknown threats and generate too many false positive alerts that overwhelm security teams.
Another benefit of anomaly detection is that it can detect threat actions resulting from the exploit of a zero-day vulnerability such as the exploitation of the Log4j vulnerability. In the case of successful remote code execution (RCE) exploit through Log4j, monitoring network connections would show connections to new IPs or domains, possibly some already known as malicious, to call back home, exfiltrate data, or fetch malicious content. Another example of a zero-day vulnerability is the recent “cr8escape” vulnerability in the container runtime engine of CRI-O that allows a threat to escape a container, move throughout the cluster, and perform malicious activity. Anomaly detection would alert on this sort of exploit behavior.
Full Protection of Your Environment
While proper configuration of the Kubernetes and container environment is important pre-deployment, just as important is comprehensive monitoring and protection of the environment in runtime. For this, there are multiple best practices you should follow to ensure you have full runtime protection across your environment.