Cloud Native / Development / Kubernetes

The Runaway Problem of Kubernetes Operators and Dependency Lifecycles

18 Aug 2020 8:04am, by

The idea of the Kubernetes operator pattern is to package up the deep knowledge about running a specific application on Kubernetes in a reusable way, by turning the implementation and operational logic into code. A Helm chart will deploy the right artifacts, but it won’t encode how to secure and back up the application or manage application users. Operators made up of custom controllers and CRDs encode operational actions and take away the overhead of making sure you spin up the right resources and dependencies for that application or correctly resize an etcd cluster when it fails in the middle of the night.

But operators have quickly become popular and the sheer proliferation means that ironically, Kubernetes ops teams are needing to gain that same deep knowledge about operators themselves, from choosing between operator frameworks to choosing the right operator from multiple options for many applications to managing the lifecycle of dependencies.

Because simplifying the packaging, installation and operation of CRDs for cluster-wide services doesn’t remove the complexity of handling discovery, access control and dependencies for the operators themselves.

A Missing Concept

The operator concept was originally devised for stateful applications like databases, caches and monitoring systems where even with Kubernetes, getting scaling, upgrades and reconfiguration right without losing data or availability means knowing a lot about how those applications work.

The underlying problem, Cassandra expert and developer relations vice president at database provider DataStax Patrick McFadin told the New Stack, is that while Kubernetes has always dealt with distributing application infrastructure, it wasn’t really designed for distributing data.

“Take service mesh. It’s the current hot topic and that’s a great example of enabling the application tier: how do you deploy all these microservices in a secure way and with discoverability.”There hasn’t been the same exploration of how to enable the data tier, McFadin said. “When you’re creating microservices, it’s ‘well, you’ll probably connect to a database, you know what to do,’ and that’s left to the user to figure out. It’s just not part of the conversation. When you’re creating a deployment on Kubernetes, and you create your deployment YAML file, everything is very neat and orderly until you get to the distributed parts of it, like your storage and database, and then it gets pretty crazy.”

Operators are the current solution to the problem, but McFadin suggested, “Operators are becoming way too much of the solution. “

“There’s a real explosion in the number of operators,” Evan Powell, CEO of MayaData, told the New Stack. “Settings aside actually using the operators, the question for the user is which of these should I even start with? Maybe this is just where we are in Kubernetes adoption, as it’s used a lot for workloads like databases and logging that need operators, and it will settle out in time — but right now it’s confusing folks.”

One MayaData customer has 15 databases in production on Kubernetes and each of them have five to ten operators associated with them.

“The operators are supposed to simplify my life, but am I supposed to look through 70 or 90 operators to figure out how to operate this complex environment? They’re a pretty sophisticated organization but they were asking ‘how do you even lifecycle an operator?’ The concept of operators is great, but now we have a gazillion of them; it’s like the old VM sprawl.”

While operators are usually associated with a single application, it’s not uncommon to have multiple instances of a database running that are associated with different operations, so unless deployments are carefully partitioned Kubernetes admins may have to handle versioning and associations of CRDs in the same way that DLL management had to be solved in Windows Server.

“The way some of these operators are designed, [you have to ask] can they do their own clean-ups? And if they do, is that even the right pattern or do you want a more generic pattern?” Operators also complicate chaos engineering and some other approaches to testing for resilience. “If you kill the workload, does the operator go away? Does it kill the right one?”

There are also questions about where governance of operators falls in an organization, Powell suggested.

Operator functionality ranges from a basic install script to sophisticated logic that handles upgrades, backups and failures to full “autopilot” operations, so it’s important to choose between operators by functionality rather than just popularity. Individual operators can get more powerful over time: they tend to start out automating installation and self-service provisioning and often evolve to cover complex automation; so you might need two or three operators initially but be able to consolidate later — which means keeping an eye on the capabilities of operators and potentially changing which you use.

“One reason Kubernetes has done so well is that it implements and supports a certain division of concerns between folks responsible for building app and for the platform team and operators,” Powell noted. “But if you’re not careful, operators pretty quickly create meetings and the whole point of the loosely coupled stuff is we don’t need any stinking meetings! You’re now potentially coupling or at least forcing conversations between folks who have other things they would rather be doing.”

To avoid those meetings, platform teams may bless a few flavors of operators as part of validated workloads.

In the long run, Powell pointed out, the same resilience testing used on apps will need to apply to operators too. “You need to build operators in the right way, you need to know how to lifecycle them, but I think you also need systems that are trying to break them in production.”

Access and Control

One reason operators are becoming popular is that they’re a natural evolution of the principles in Kubernetes itself, Pulumi founder and CEO Joe Duffy told the New Stack. “It’s interesting because in a way it’s realizing the dream of Kubernetes which is lots of these just loose, independent, self-managing self-healing, long-lived processes. That is the Kubernetes design philosophy and so operators are sort of the natural realization of that when you start to think about extensibility. But with the explosion of operators, there are a lot of dependencies to manage, a lot of versions to manage.”

Reducing the overhead of working with operators requires tackling access and lifecycle issues. Duffy compared the early Helm component Tiller running the cluster namespace to “an operator before operators existed,” leading to RBAC and security issues.

“One of the things we hear from our customers is that in the early days of operators people hadn’t figured out the RBAC story. Having the right approach for RBAC and namespacing can really help tame some of that complexity, because now you’re really segmenting who has to care about which of those 80 operators. Rather than having a sea of 80, you can actually have pockets of five operators per team.”

Evolving Patterns and Definitions

The definition of an operator remains fluid; the Kubernetes App Delivery SIG has been discussing it for some months without a clear result) because this is an active area of development that’s still evolving.

There are multiple projects for creating and managing operators (KUDO (the ambitiously named Kubernetes Universal Declarative Operator), Metacontroller (and the Metac project that started when development on Metacontroller stopped), and Red Hat’s Operator Framework, which has finally moved to the Cloud Native Computing Foundation as an incubated project, as well as projects with similar ideas like kubebuilder and Cluster Addon Operators). There are so many operators and so many ways to deploy operators (through Helm, KUDO, the Operator Framework OLM and raw Kubernetes manifest) that two of the handful of sample queries for Artifact Hub are specific categories of operators.

You’ll see the terms operator, controller and CRD used somewhat interchangeably, even though they are subtly distinct.

Things are further complicated by the way operators have evolved beyond addressing stateful applications and are now getting used for very different concepts, including stateless applications that call external services.

Microsoft recently switched away from using the Open Service Broker API to provide a Service Catalog for Azure to offering an Azure Service Operator that customers can use to dynamically provision not just Azure storage, database and monitoring services like Cosmos DB, Event Hubs, Application Insights but even virtual networks and VM scale sets from within a Kubernetes cluster. The Amazon Web Services Service Operator is moving to become a set of controllers that enable you to manage AWS services from Kubernetes.

The fluidity of what an operator is and does allow for this kind of experimentation in ways that a more rigid definition might constrain too much. “The operator framework is really flexible,” Mcfadin pointed out. “If you look at the Operator SDK, it’s not very opinionated and it’s not meant to be. It’s meant to be a pretty general-purpose framework.”

But the same flexibility also leaves operators as an advanced area that can be confusing for the people who they’re there to help.

“Operators are really on the bleeding edge,” Duffy pointed out. “We’ve got multiple years of learnings before we really understand what the best operator patterns are.”

Beyond Operators

One possibility is meta-operators that run operators. The Operator Lifecycle Manager in the Red Hat Operator Framework started out as a meta-operator for services like Prometheus and Vault in the CoreOS Tectonic Kubernetes distribution. KUDO offers a high-level approach to creating operators in YAML that allow KUDO to act as a Kubernetes operator for multiple applications.

That might also address the lack of tools for human operators who have deep knowledge about running applications to encapsulate that knowledge in an operator when they want to make those applications container ready and start orchestrating them with Kubernetes. “Even though they know that operators are the way to go, they don’t know how to easily build one with all the learning they’ve acquired over the years,” noted Amit Das, who maintains the Metac toolkit.

An option is subsuming operators in other management tools. Pulumi now has an operator to enable deployments from within a cluster using GitOps workflows, which also enforces cloud policy as code using native admission controllers to apply policy “on the way in” and could replace some other operators. It’s built using Flux, a Kubernetes GitOps operator that can trigger operator actions based on git merges and commits.

“The Pulumi operator can manage all of your other tiers and your other operators, so maybe you need fewer of those other operators,” Duffy suggested. “Do you really need a Cassandra operator to manage the Cassandra infrastructure or do you just use the Cassandra provider for Pulumi that allows you to manage your Cassandra database? Rather than installing the AWS operator and the Azure operator, because they have their own prototype operators to manage infrastructure on those clouds, maybe standardise on the Pulumi operator to manage all those things and that can help bring down the explosion of operators.”

Higher-level services in Kubernetes itself may be another direction, along with changes that make applications better suited to running cloud native.

One reason there are more than ten and possibly as many as 20 different operators to choose from just for Cassandra is that open source encourages developers to scratch their own itch, and McFadin suggested that the best way for operators to improve in the short term would be for more Kubernetes users and admins getting involved in the projects, to get the features they need into a standard operator rather than writing their own.

But the other reason for the multiple Cassandra operators is that the database has “an enormous amount of surface area configuration, with almost a direct correlation to how much surface area configuration there is on the project and how many potential operators there are because every one is going to express a different part of that.”

The Apache Cassandra team is working on consolidating the dozens of options into a common operator that will be part of the main Cassandra distribution, encoding how it co-exists with Kubernetes. In the long term, that might even lead to the idea of data and distributed storage services that are a first-class citizen in Kubernetes, McFadin speculated, because the alternative is “pick one of the dozens of potential operators and good luck getting it right.”

Projects could also do more work to reduce the need for configuration, he suggested. “You should be able to deploy your product or your software with very little configuration and have it work in the optimum state.”

And for all the issues with the proliferation of operators, they’re a clear advance over needing deep domain expertise or dealing with raw APIs, Powell pointed out. “At least we’re in the same room talking about the best patterns to exploit what is actually a huge leap forward in interoperability or at least in consistency of APIs and approaches across systems.”

“Any large-scale industry that we create as humans, we make more efficient, we automate it in some way; we reduce the amount of people that needed to be involved for higher productivity. Operators are a step on the way there.”

Amazon Web Services, the Cloud Native Computing Foundation, DataStax, MayaData and Red Hat are sponsors of The New Stack.

Feature image by Hans Braxmeier from Pixabay.

A newsletter digest of the week’s most important stories & analyses.