Rafay sponsored this article for The New Stack.
Haseeb Budhani has seen it happen again and again.
And then things start to get … interesting.
Clusters multiply. Changes proliferate. Access demands pile up. Cloud costs spike.
“Whether you’re talking to a high-tech company, or a financial services company, a healthcare company, or a retailer running apps at the edge, the problems are all the same,” Budhani, CEO and co-founder of Rafay, a Kubernetes operations platform, told The New Stack.
“How do I manage access to my clusters? What’s the policy model that I’m going to use across all my environments? What add-ons must I always have in the standard blueprint for my production clusters? What is my strategy to deploy applications that belong to multiple business units? I’ve got to upgrade all my clusters soon since I’m already three versions behind across the board — how do I do that?”
These problems can all be grouped under the heading of “Day 2.” And Day 2 can mean a never-ending headache for site reliability engineers (SRE) and IT operations engineers.
“We’ve got to be honest about the pain here,” Budhani said. “We need a moment of catharsis in this industry.”
Why Is Day 2 So Painful?
The pain has a number of root causes, Budhani said. First, there’s the matter of the skills gap — eight years into the Kubernetes era and there still aren’t enough engineers who know enough about the K8s ecosystem.
A lack of in-house skills is the top challenge that companies encounter when adopting containers and Kubernetes, according to a survey released last June by Canonical.
Then there’s the hodgepodge of tools in the cloud native ecosystem that your organization uses to operationalize Kubernetes, each of which also regularly requires upgrades and attention.
“All these tools, they follow their own lifecycle. Every so often, each of these tools will need to be updated across all the clusters,” Budhani said. “So, you have to manage the lifecycle of your Kubernetes cluster, the lifecycle of each of these tools, the lifecycle of your applications, centralized policy and access management as new internal teams deploy more apps, a disaster recovery strategy for each app, charge-back strategies, and more.”
“This is Day 2. Day 2 is about what needs to happen to keep the lights on while the underlying technologies each require custom strategies for their governance, operational security and visibility in a fully automated fashion.”
An overarching issue he sees is that not enough enterprises are using what he calls “automation with governance” — the developer velocity and freedom from unnecessary toil that cloud native architecture promises, coupled with the checks and balances that organizations need to control access to critical data, applications and infrastructure, and control cloud costs.
“We’re not aligning to the North Pole that we all agreed was the right thing to do – automation,” Budhani said. “By definition, if you’re building it again, and again, you aren’t following the first rule of DevOps: Automate everything.”
What Does Day 2 for Kubernetes Ops Look Like?
On an ongoing basis, managing your Kubernetes operations requires keeping track of a number of things. For organizations that need to deploy into multi-cloud or hybrid environments, this complexity — and the challenges of keeping tabs on all the moving parts — compounds. In fact, the fear of dealing with that complexity can keep organizations from moving toward multi-cloud solutions in the first place and could lead to vendor lock-in that prevents an organization from realizing its business goals.
But the key areas that need attention remain the same, no matter where you’re deploying your applications. Here, according to Budhani and other experts, are five pillars of Kubernetes Ops:
Cluster Standardization and Lifecycle Management
“You know what your cluster looks like today, when you built it,” Budhani said. “But how do you know what it looks like a month from now?” Even if you don’t touch the cluster again, he noted, “an installed add-on with high-enough privileges could end up changing foundational configuration without you knowing about it.”
You will need to keep tabs on your cluster’s entire lifecycle, including how it’s affected by the other tools and users that interact with it. Setting standards for creating and updating clusters across your organization, while ensuring that a sanctioned set of add-ons is always running across your cluster fleet, can help simplify the Day 2 task of identifying anomalies when they occur.
Secure Access and Isolation
A distributed network, run in full or in part on the cloud with the help of Kubernetes, demands an entirely new approach to operator/developer access and security. A network that lives everywhere is vulnerable to attack anywhere. (Sleep well tonight, dear reader!)
The zero trust approach to security has been gaining ground among organizations that have moved or are moving to the cloud. Zero trust rejects the old “castle and moat” model of security, instead using granular, automated authentication and authorization privileges to protect vital infrastructure and data, wherever they may live.
But many, if not most, organizations are still grappling with the basics when it comes to access controls. Eighty percent of participants in a survey released in January by strongDM said their organization would be working on access management this year; only 30% said a zero trust project was in their plans. (And one in three respondents of that same study called Kubernetes the most challenging technology they work with.)
Securing access to the Kubernetes API server can help prevent unauthorized probing. And when something goes wrong in a particular Kubernetes cluster – the injection of malware, for example — that cluster or microservice needs to be isolated to avoid the problem from spreading.
Observability and Visibility
The administrators of your Kubernetes clusters need enough visibility into all environments, along with the requisite level of alerting and monitoring, to triage issues as they arise. Solutions such as Rafay’s Kubernetes Operations Platform provide these functions out of the box. Having access to long-term metrics and alert data can really help SRE and IT Ops understand trends across their cluster fleet to help with planning and forecasting.
Governance and Compliance
Kubernetes is, of course, open source — wide open, like the Wild West. And the companies that use it often struggle to add critical governance and compliance capabilities, such as logging, drift detection and auditability.
Centralized enforceable cluster configuration models help with enterprise-wide cluster standardization. Having a way to ensure that all mandated security and operational add-ons are deployed helps ensure compliance with enterprise policies. Further, having a way to detect when a cluster deviates from enterprise policies, and remediate the issue if it arises, is also a critical requirement.
Rafay’s Kubernetes Operations Platform provides capabilities such as cluster blueprinting, add-on version control, policy enforcement and violation reporting, along with drift detection logic that can block changes to cluster-wide resources such as ingress controllers, runtime security tooling, etc., and an end-to-end audit trail of cluster activity.
Third-Party Integrations and Maintenance
Making all the services and tools, which power your modern (Kubernetes-based) infrastructure, operate seamlessly and play well together can be tricky. Major cloud providers usually have a suite of tools that help manage Kubernetes — but these don’t always translate well if you step outside that cloud provider’s universe, perhaps to deploy to multiple clouds, in on-premises environments, or, in some cases, at the edge.
It’s like a jigsaw puzzle of components that need to constantly fit together even though each puzzle piece has its own lifecycle to manage. Open source tools and components can bring their own problems: vulnerabilities like those discovered late in 2021 with Log4j, for instance. Or simply the toil involved in updating your version of Kubernetes itself every quarter.
Outdated tools can result in unplanned downtime, which can directly impact end customers — and, ultimately, the business.
The Ongoing Cost of Building a K8s Platform
The burden of Day 2 Kubernetes isn’t just about cloud spend. Operating and maintaining Kubernetes can also pull team members away from working on products and applications that directly generate top-line revenue for the company.
Some of the things that can add to Day 2 headaches stem from a lack of standardization. They can include complex triage and support costs when clusters with unique configurations fail, or security risks resulting from custom access and networking between controllers and clusters. A lack of kubectl access control can also expose the business to compliance and governance risks.
“Managers are sometimes reluctant to bring up Day 2 issues in the early days of the Kubernetes journey,” Budhani said, because “they don’t want to upset the developer mindset.”
That reticence is misguided, he added: Developers already know their workload is out of balance because of Kubernetes Day 2 problems.
“When I talk to actual developers, they say, ‘I don’t know why I’m writing Helm charts instead of my app,’” Budhani said. “‘Yeah, I wanted to experiment with Kubernetes. I kind of liked this new technology. And I loved learning about it when I was initially exposed to it. But, my God, I’ve got a job to do.’”
C-level executives, DevOps managers, and developers all want the same thing, he said: an efficient way to ship more and better code and generate more revenue for the business. But, he noted, “they’re not talking the same language. And in the process, these large enterprises are way behind schedule on their deliverables.”
Solving the Kubernetes Day 2 Problem
“To prepare your teams and your organization for the long-term commitment of overseeing a cloud architecture built on top of Kubernetes, it’s important to understand exactly what your organization is getting into,” said Budhani.
Before you even start a cloud native project that includes operating Kubernetes, think about what your organization is trying to accomplish, he urged.
Chief information officers and other senior executives, Budhani said, “used to challenge their teams to do more, to experiment more.” Now, he said, the question should be, “Why are you experimenting with Kubernetes add-ons? Prove to me that you need to build something beyond the things that can be purchased off the shelf before you go experiment.”
The true cost of standardizing on K8s includes pricing models, implementation and maintenance. Among the questions to ask:
- How will you determine which tools work best for your use case?
- How will you keep up with open-source changes?
- Will there be continued investment in tool integrations?
- Who will fix interoperability issues and operational problems when they arise?
- Can you hire enough people with cloud native skills — or train the team members you already have – fast enough?
Above all, consider talking to other organizations that have also crossed the chasm and implemented Kubernetes.
“First, understand how others are doing things —because that gives you a sense of how hard or simple this problem is,” Budhani said. “Learning by failing is great, but it comes at the cost of time. Why take this path when you can learn from your peers? The beautiful thing about the Kubernetes community is that people are quite open to sharing their experiences and opinions.”
He concluded, “In 2022, you’re not the only company working on Kubernetes. Lots of other companies have been through the journey or are on this journey. There are resources out there to do it right. Look for them.”
Rafay and strongDM are sponsors of The New Stack.