The 2023 State of Kubernetes in Production
Something is too complex in the state of Kubernetes in production. This isn’t new, but in 2023, the year that’s seen the mainstreaming of platform engineering, container management’s biggest play may be flexible, but there’s still a lot of challenges around policies and governance for ops and onboarding of developers. Eight years in, the complexity of using Kubernetes, especially in production, is still a massive undertaking.
Spectro Cloud just released the 2023 State of Production Kubernetes, in cooperation with Dimensional Research. Spectro Cloud’s Ant Newman, who led the project, gave The New Stack an exclusive interview to discuss some of the most pervasive challenges — which a whopping 98% of the 333 IT operations and app development stakeholders interviewed responded they had.
“If you’re trying to standardize and have the same stack in the same environment for all your applications — that’s not realistic for today’s IT environment. It’s never been realistic,” Newman said. “The idea instead has to be how do you manage diversity, rather than how do you limit it.”
Indeed, one of the biggest draws of Kubernetes is its extensibility and broad use cases across environments. But with great flexibility comes great complexity.
From enterprise guardrails to the operations tech talent gap to inconsistencies across different environments, Kubernetes users are dealing with two sides of the same coin — flexibility versus complexity. Both have a huge impact on internal developer experience. As we kick off KubeCon + CloudNativeCon North America, let’s reflect on the challenges of managing Kubernetes in production. As Newman said, it’s all about finding ways to manage without limiting.
Kubernetes Complexity Has Consequences.
“Kubernetes is the most frustrating, painful, and beautiful thing I’ve worked with in my technology career.”
This anonymous, yet candid survey response from an IT operations manager sums it up well. Users have a love-hate relationship with Kubernetes — or K8s if you’re cool in certain parts of the internet. DevOps teams want this flexible way to manage containers — in the cloud, on bare metal or virtual machines, or at the edge — but are struggling to maintain this level of complexity in a secure, scalable way.
The majority of enterprises interviewed have more than 10 Kubernetes clusters across in more than one hosting environment — 14% have more than a hundred. “Complexity is multiplied by the number of Kubernetes distributions, each with slightly different use models and capabilities that must be understood,” the paper explained.
The paper found that 83% of those interviewed had between two and more than 10 distributions, across service distributions (like Amazon Web Services EKS-D), self-hosted distributions (like Red Hat OpenShift), edge-specific distributions (like K3s and MicroK8s) and more.
This adds up to about 20 different, documented pathways in a single organization. Add to this, a lot of these differences are due to regulatory or industry requirements.
Giving developers access to clusters is another real operations pain point.
“What we found in the interviews was that you start with developers’ self-service in order to give them control and speed,” Newman told The New Stack. “And then you realize that’s causing problems, or the devs now have to own maintenance of the documentation, maintenance of the configuration, and suddenly they’re not coding features anymore. They’re managing plumbing.”
So, after Kubernetes led much of the drive for developer autonomy over the last few years, now it’s driving a swing back toward guardrails, golden paths and standardization, as ops teams can’t keep up with the unique requests. Companies of all sizes are trying to figure out where they want the pendulum to land, with each company deciding on a different balance, Newman said, which can include:
- Starting up a cluster each time.
- Deploying on infrastructure owned by the ops team.
- Developers running a cluster in the cloud or in their home lab.
- Via a self-service internal developer portal.
- Raising a ticket with operations.
Each org has its own way — or several ways — of working with Kubernetes in production, but, he continued, every organization they interviewed is constantly considering how to balance the speed of developer self-service with the necessary ops control.
Platform Engineering Isn’t a Panacea.
“There’s still an issue here that needs to be solved. And for all the important progress on platform engineering over the last couple of years, and that being the golden path forward, I don’t think the interviewees felt that had been solved yet. That’s why they’re trying all these tools promising to fix the complexity of Kubernetes for developers,” Newman said. “But there was a good chunk of people who said we’re trying these things and we stopped trying it because it didn’t work out — almost like sticking another tool on top doesn’t actually solve the fundamental problem of the balance between control and self-service.”
Indeed, 14% of respondents had piloted at least one developer experience tool and later dropped it. Exacerbating this, the second biggest challenge found by the report is not having access to operations talent that can deal with the constantly evolving Kubernetes landscape. Operations burnout is real as they try to keep up with the demand to regularly upgrade and patch this plethora of solutions.
“The interviewees said that they were in this vicious cycle of spending time on troubleshooting and patching, which means they don’t have time to invest in building golden paths, investing [in] automation and looking at how they should simplify — because they’re just running to stand still,” he continued.
Plus, 14 years into DevOps, developers still aren’t used to being responsible for how their code will run later and feel it can be a distraction from their traditional development mindset, the report found. We know that the shift left famously shifted their attention away from their flow state. This drives a “significant need” for tools to help developers leverage Kubernetes, with 62% of interviewees either having adopted or are in the process of adopting a tool for application developers.
Yet, while 92% agreed that developers should be spending their time coding features, not managing infrastructure, 82% said it’s difficult for ops teams to give every dev team a cluster tailored to their preference. It’s clear that Kubernetes demands a golden path, or perhaps several routes to production; this free-for-all cannot go on.
Especially in the tighter times of 2023, where everyone is trying to do more with less, the team at Spectro Cloud has seen a bigger push for standardization to help with both cost efficiency and security. Just make sure you aren’t creating gates, Newman said, where you stifle developer creativity and experimentation.
This is why the most frequently experienced challenge of those interviewed was how to establish enterprise guardrails. This is the first year this challenge — with 48% complaining of it — has topped the list. The report said this points to the maturity in Kubernetes adoption, as complexity grows when you start managing containers in mission-critical, business-impacting use cases.
Interoperability Remains a Challenge.
As your Kubernetes strategy scales, interoperability becomes a greater challenge, too. In fact, three-quarters responded that they suffer interoperability issues — like among service mesh, persistent storage and secrets — at least occasionally.
In fact, those with 20 or more clusters in production were found to be three times as likely to suffer issues because of interoperability.
There are still some parts where the enterprise interviewees were confident a platform-based approach would work. Specifically, 86% want to unify containerized and VM workloads into a single infrastructure platform.
It’s notable that this complexity scales exponentially, with companies that have more than 20 production clusters reporting significantly higher levels of complexity indicators. These companies were far more likely to report that they had more than five distributions, along with another challenge of more than 15 distinct software elements, including:
- Load balancers
- Secrets management
- Security tooling
- Service mesh
- Monitoring and observability
“We’ve always argued that a production Kubernetes cluster is way more than just the choice of distribution, CNI and CSI and the OS underneath,” Newman said. “Eighty percent of the value, and 80% of the complexity, comes from the choices you make about what goes into the cluster to support your applications.”
They all make the Kubernetes interoperability significantly harder, too. And the survey found the correlation that the more clusters you have, the more of these different elements make up your stack. Which in turn again makes it harder to standardize across the org.
“The more elements you have, the more opportunity there is for interoperability issues. The more tools you need to configure and secure. The more things you need to patch and update,” he continued. “This is why full-stack, declarative management is so important.”
Automation Reduces Complexity.
So how do you solve a problem as scaled as Kubernetes complexity? How can ops teams resolve that development, staging and production environments are different? How can they spend less time troubleshooting and more time maintaining availability and application performance?
More than half of respondents felt automation would make a significant improvement to operational efficiency.
However, the paper found that “Companies that develop automation scripts but do not treat them as an essential part of their infrastructure can create a nightmare when staff changes and knowledge of maintaining the scripts is lost.” Which could be why the second most common solution was simplifying the software stack — a logical response to the second most common challenge of not enough skilled operators.
“If you’re deploying, you’re serving lots of different teams, and lots of different applications, and lots of different environments. You can’t necessarily simplify the stack all that much. There’s a reason why each of those teams picked each of those different tools or environments,” Newman warned. “That’s almost like, how do we lose weight? Let’s cut off our own arm. It’s not the right answer.”
But if an enterprise does a really good job of automation — and in documenting the why and the how for future operators — he said that you can maintain that level of diversity in the software stack, while still scaling operations’ coverage.
At the Edge, Kubernetes Is Popular.
This is the second year the State of Production Kubernetes report has zoomed in on Kubernetes for edge computing, which has for a while now been established as the way forward. The edge promises business process improvements like cost savings, as well as connecting in new, interesting, and even far away ways. Edge computing adoption is also being driven by compliance and data security requirements and new workloads that only function when deployed at the edge. Plus there’s a lot of potential for AI on the edge.
“Increasingly, the edge is the best place to do it,” Newman said, but it’s still about figuring out how to get there, as many enterprises are still experimenting with edge use cases.
It’s telling that, while 93% of respondents with Kubernetes in production are working on edge computing initiatives, about a third of respondents were unsure of their edge application and how to apply it yet. Indeed, only 7% are actually fully deployed with Kubernetes in production on the edge, with another 13% partially deployed. Another 29% are piloting their edge initiatives. The year-over-year trend is a growth in interest in edge computing that will likely continue.
Kubernetes has become the de facto way to deploy containers on the edge, but another challenge that permeates these developing edge computing strategies is, again, Kubernetes complexity, this time at the farther edge.
Newman spoke of a customer that supplies hardware, with applications running on the edge, to thousands of medical clinics. With persistent downtime on the edge, they were struggling to find the technicians to head out to the sites in a timely fashion.
“These aren’t the sort of challenges when you’re doing edge in the lab — rebooting it is a case of walking across the lab and pressing the button,” he explained. The challenge in these early-stage adoptions are about figuring out how to make the edge economically viable — the next stage of edge computing adoption, out in the world. Plus, of course, he continued, security is paramount.
Newman excitedly concluded our conversation with: “This year, around edge, we’re seeing a huge amount of interest, and I think next year’s report we might even do a separate report just on that edge because this is one of the fast-growing, fast-evolving areas for us.”