Kubernetes: When to Use, and When to Avoid, the Operator Pattern
In the world of Kubernetes, Operators have quickly become a popular pattern far beyond their initial use for encoding deep operational knowledge about running stateful applications and services like Prometheus. But the complexities of CRD lifecycle management they bring with them means that writing an operator will not always be the best solution for your own applications, because you’re creating more code to maintain.
I can’t disagree with this article enough. Writing operators is a very advanced use case that should be almost never recommended. Right now invest your IT teams time in GitOps, not operators. https://t.co/qPXx8G9pL0
— Darren Shepherd (@ibuildthecloud) August 18, 2020
Rancher Labs Chief Technology Officer Darren Shepherd strongly encourages developers to look at GitOps and other configuration management options before they consider writing their own operator, which he suggested to the New Stack should be reserved for only the most advanced use cases.
Operators are good for automating the knowledge of how to operate a complex system, but does it make sense to use “one-off orchestration logic” to manage an application, when there are existing patterns and primitives that you could use?
“Basically the only time you would really write an operator with custom orchestration logic is for persistent systems that are highly available: not a persistent application because that’s typically just talking to a database, but the actual persistent system like Cassandra that needs to be highly available and has custom logic on how to do quorums and failover and stuff like that,” Shepherd explained. “When you look at those use cases, an operator for Cassandra or MySQL or persistent system really makes sense. But if you look across the board, how many of those systems exist? And when they do, it’s the vendor or the project that should be writing the operator, not you. This is a very advanced concept that I don’t think should be very actively pursued by most people.”
There are two main problems Shepherd sees people trying to solve with operators where they’re not a good fit: to handle configuration management and to expose non-Kubernetes resources as Kubernetes objects.
“The first is, they actually have a configuration management problem. They’re like trying to simplify how they deploy, and manage applications, and they have it in their mind that somehow operators will solve that.”
That leads to people creating an operator that bundles Ansible or a Helm chart in an attempt to solve deployment management issues when what they actually have is more of a traditional configuration management problem he believes is better solved by Helm (which Rancher is standardizing on for its 2.5 release), GitOps or Kustomize, or by more ambitious projects like the CUE data constraint language (a generic approach to automation and scripting).
It’s not that GitOps will let you do the same things an operator does, but that what an operator does won’t solve configuration management problems. “GitOps is completely different but it’s where your time is better spent because it’s most likely the solution to your problems at the moment is going to be in GitOps not in writing operators.”
As Rancher has started to tackle edge computing, which “amplifies the problem of multiple clusters,” he noted that “we’re realizing that we’re effectively just reinventing configuration management.”
“To have an operator says ‘I’m going to create something to embody all the complexities of operating an application’ and that just doesn’t make much sense because honestly, whose application is that complicated? What’s so complicated about your application that can’t just be described within the primitives that already exist? Fix your application, don’t try to bundle it in this operator.”
Applications Aren’t Services
The second misuse of operators is to use it to expose non-Kubernetes applications as Kubernetes services.
“I think better paradigms will come along for how to do that than what operators are doing right now,” Shepherd said. “I’m looking at frameworks like CUE where I think we’ll see the ability to consolidate a bunch of primitives into a higher-level primitive without requiring a lot of custom programming and orchestration logic.”
Rancher’s experience writing controllers for Kubernetes has shown that there are some standard patterns. “They’re largely data in, data out,” Shepherd says.
“If you had a better configuration management like a configuration language, you could largely automate away a lot of the complexity of controllers and operators and whatnot. Because most of the time it’s ‘I want to take this input data, and then I want to render it with a bunch of existing data or produce new data, and then reconcile that state’; it’s a very common pattern.”
Operators don’t turn complex applications into something you can consume as a service, he noted. “It’s a little ideal to think that I’m going to give you this operator and it just automatically does everything for you. No, it’s just going to make it a little bit easier but you’re still operating a very complex system.”
Although there are a few use cases where operators are so extremely useful that they’re invaluable, they’re also a highly abused concept and it’s not clear that people are thinking through the security implications before diving into writing operators, Alcide founder and Chief Technology Officer Gadi Naor told us. In the majority of cases, he would pass on building operators.
Operators perform tasks that become part of the platform infrastructure; that means a lot of moving parts that human operators need to understand clearly.
“Operators are a high-privilege component: by design, they run persistently inside your cluster and from a security standpoint, that’s introducing risk. Having many of them, which is the current situation in the ecosystem, introduces a lot of risks to your cluster. There are a lot of complexities around building an operator and getting it right and making it run fully autonomous in a way that is bulletproof; that’s a lot of heavy lifting. Why would I want to run an operator which is highly privileged components, either at the cluster level or namespace level when it performs its intent only a very small fraction of the time?”
Needing an operator may be an indication that you’ve created too complex an architecture and Kubernetes isn’t the best solution for the application you’re building, Naor pointed out. You should also consider how it will be consumed.
One test of whether you should write an operator or not would be if this is a one to one relationship where your operator is managing just one instance: there has to be a different way of building sophisticated highly privileged automation.”
“One option is that you can just run jobs or use existing components and of tie them together, instead of building operator that would do the same thing. Instead of writing your own operator, you can declaratively articulate what your operator should do and then a component that is more hardened and well designed can take care of the heavy lifting.”
“Think about what the lifecycle is that you think your operator should manage. If it’s something as generic as backup and restore, probably someone already did that in a generic way that would fit your application. If you’re going to build an operator that sense there’s a new microservice version available and you need to upgrade stuff, GitOps is a pretty generic way of achieving the same thing: you commit that there’s a new version and the GitOps agent will synchronize that to your target environment and everybody’s happy.”
Naor suggested KUDO as a declarative alternative for automating the deployment, installation and lifecycle of complex applications on Kubernetes. “You write in a declarative way something that is equivalent to an operator; not as full blown but it gives good coverage for the majority of use cases. And where you would have run many operators, instead you run just a single one that performs all the heavy lifting.”
Helm and Operators
KUDO bridges the gap between what Helm can do and building the entire dependency tree of an application, Naor explained.
Helm may even do what you need, Shepherd noted. “The Prometheus operator installs a Helm chart and then you get a bunch of CRDs to do things. All an operator is, is a set of controllers so why did I have to make this a first-class concept?” He suggested the pattern of using a Helm chart there the operator is just controllers inside that to deliver more types. “That makes more sense to me than wring an operator that’s a very specialized approach; Helm is a generic approach that applies to both my own and third-party applications. What is the value that I’m getting out of this one-off [operator]?”
While Helm maintainer Matt Butcher has previously rejected the idea of using Helm to perform “operator-like tasks,” more operators are stretching the original definition and acting as custom installers in ways that overlap with Helm. The project is currently discussing how Helm 4 will deal with the fragility of CRDs as cluster-wide shared global resources without sacrificing the usability of Helm, and there are questions about handling the operator pattern as part of this.
That discussion describes CRD handling as “the most intractable problem in Helm’s history” and suggests that the real problem is that “Kubernetes is not yet mature enough” for Helm to be able to deliver “robust support” for CRDs. There’s also concern that needing to understand CRD management to consume a Helm chart safely would change the current assumption of the project that Helm users should not need significant Kubernetes knowledge.
Because writing operators commits you to maintaining the code, you will also have to commit to hiring more developers and a Kubernetes platform team who have skills with operators in the future, to make sure you use of Kubernetes is robust and secure.
Another automation option Naor suggests considering is the OpenKruise project. This project expands on some shortcomings in Kubernetes’ basic scheduling constructs, using the notion of advanced stateful set, an enhanced version of the Kubernetes default stateful set for building stateful services on Kubernetes. Another idea, the sidecar set is a declarative way of injecting sidecars.
“Open source projects building technologies based on sidecars were implementing the same thing over and over again, which is a mutating admission controller that inject sidebars. So they built something that you install once and then declaratively say what to inject and where to inject,” Naor said.
Karl Isenberg, who previously worked with the KUDO team at Mesosphere and was technical lead manager of the PaaS team at Cruise Automation that built Isopd (a YAML-free tool to help manage common resources across clusters before multi-cluster addon management became more sophisticated) compiles a useful taxonomy of Kubernetes app deployment tools that includes both operators and alternatives.
“The ecosystem is still trying to work out the best way to manage Kubernetes workloads: there’s a lot of options and they all have pros and cons,” he told us. With so many ingress and service mesh options around, there’s no standard to build deployment automation on top of, so solutions tend to be non-transferable and limited to a custom stack.”
He noted that similar patterns to operators were found in Mesos, where the requirement for applications and services to have a custom workload scheduler as part of the two-level scheduler led to a mix of generic schedulers for stateless applications and specific schedulers for complex distributed systems like Spark and Cassandra that needed lifecycle management. “Operators built on that. But after a while, they took over and everybody was writing operators to handle custom lifecycles, partially because it became harder to upstream feature changes into the Kubernetes scheduler.”
Improvements to the Kubernetes scheduler may obviate the need for operators in some cases, Isenberg suggested: “When operators emerged people were using their own custom controllers and operators to manage the workflow or lifecycle of their application because they couldn’t customize the scheduler or plugin a custom scheduler. It’s now possible to set annotations on your workload so the primary scheduler ignores it and it gets scheduled by a custom scheduler.”
But custom schedulers are still rare: “I only know of two that aren’t just operators,” he noted.
Sets and Scale
Operators are helpful for workloads with “data weight,” Isenberg suggested. “Data has gravity and it might take more than 30 seconds to evacuate a node, or it might require something more complicated that stateful sets, which were designed for the etcd pattern where you have name nodes and those names aren’t supposed to change, they’re just supposed to come back up. That’s a design that predates cloud native, for bare metal or VMs that didn’t autoscale. Those pre-cloud native deployment designs needed to be adapted to Kubernetes and so people did that a lot with operators.”
If you’re writing a stateless application, you don’t need to write an operator at all. “But nobody just writes a stateless application anymore; everybody’s writing complex microservice distributed systems, and if you have any sort of data pipeline or streaming, you end up with dozens of services and many of those you pull off the shelf and they might come with their own operator because the way somebody who had a more complicated product adapted that to Kubernetes was with an operator.”
Sometimes, what you think is an operator is really a controller, he noted. “The API was designed to be used both declaratively and imperatively, and operators are an imperative way to use Kubernetes. Otherwise, most imperative use is either direct API usage with a controller that doesn’t have its own CRD, or just scripting and your CI/CD workflow. Because more people are using Kubernetes, more people are writing code to exercise the imperative workflows and most of them are calling those operators because they end up with a CRD in them. But sometimes there’s a little bit of name conflation with controllers and they mistakenly get called operators just because it’s trendy.”
Operators are fundamentally a way to get around the lack of dependency management or resource hierarchies in Kubernetes, Isenberg suggested (echoing much of the Helm discussion about CRDs), and neither of those are problems that will be solved quickly so sometimes an operator is the answer.
“Writing an operator isn’t as easy as making a deployment but if a deployment doesn’t work for you and a stateful set doesn’t work for you, it’s not a problem to write an operator. But you’re building a piece of software that you’re going to have to maintain indefinitely and not just a configuration that you can tweak and replace. The more code you write, the more locked into that platform you are. So when you write operators, you’re locked into Kubernetes, whereas a deployment is like a configuration file; you could just throw away the deployment and take your service somewhere else. Gitops is definitely more portable, writing Terraform is more portable.”
Because writing operators commits you to maintain the code, you will also have to commit to hiring more developers and a Kubernetes platform team who have skills with operators in the future, to make sure you use of Kubernetes is robust and secure.
Operators are the ‘break glass’ option for when the resources in Kubernetes aren’t expressive enough for what you need to do, Isenberg noted. But if you’re adapting applications and workloads to run on Kubernetes and operators aren’t already available for them, it’s possible Kubernetes isn’t the best place to run them.
“Everybody wants to maximize their investment in Kubernetes but I feel there’s a little bit of a sunk cost fallacy that goes on. Just because you can run something on Kubernetes doesn’t mean you should. There are some applications that are just easier to manage on VMs because they were built in an era where that’s how they were designed to be deployed and managed.”
Isolate Operators with Side Clusters
If you do decide you need to write operators to do what you need, “You should really think about having something that reduces privileges for the duration that the operator is idle,” Naor suggested.
“Think about having something that externally reprovisions those permissions so that you don’t have something highly privileged running all the time. Or maybe running those operators outside of the cluster and disconnecting and reconnecting them, rather than running them forever. From a security standpoint, it makes more sense to not having everything run in the same place, because if the operator is compromised, then potentially the entire cluster can be compromised.”
Side clusters — a term he coined in analogy to the sidecar pattern — are particularly useful for security and monitoring services and could also improve the security and privilege concerns with operators.
“The recommended way to monitor the Kubernetes audit log is not from inside the cluster, but rather than having an external security operations cluster that connects to the target cluster and perform the monitoring, which means that the cluster is more resilient to a threat actor trying to neutralize your security.
With Prometheus, if you’re running inside the cluster, then the Prometheus instance is susceptible to the cluster conditions, and if the cluster became unstable it would destabilize your monitoring system. This is why you have projects like Thanos and others that are trying to push the monitoring piece outside of the cluster and then just scrape the cluster from the outside world.
From a security standpoint, you have similar sets of challenges. “With some of our customers, where they are running multiple clusters, we are building side clusters that are responsible for operating, in our case, security-related tasks that need to run or be kept outside of the application clusters. If I’m running my operators on a side cluster and regulating the permissions to the code the primary cluster, that can improve the security side of things, though not necessarily the overhead of managing many operators,” Naor said.
And if the namespace and RBAC considerations of hosting operators outside the cluster seem complex, it’s probably another sign that you should take a different approach from operators.