We Pushed Helm to the Limit, then Built a Kubernetes Operator
K8ssandra is a distribution of Apache Cassandra® on Kubernetes, built from multiple open source components. From the beginning and continuing through the most recent K8ssandra 1.3 release, K8ssandra has been installed and managed as a collection of Helm charts. While the project has leveraged Kubernetes operators for components, including Cassandra (cass-operator) and Medusa (medusa-operator), there hasn’t been an operator to manage all these components as a holistic system.
The K8ssandra team recently finalized a decision we had debated for months: to create an operator for the K8ssandra project. In this article, we present our experience using Helm, our decision to create an operator for K8ssandra, and what we hope this will accomplish for the project.
How It Started
The core of K8ssandra is cass-operator, which we use to deploy Cassandra nodes. Around this, we added an ecosystem of components for operating Cassandra effectively in Kubernetes, including operational tools for Cassandra for managing anti-entropy repair (Reaper) and backups (Medusa). We include the Prometheus-Grafana stack for metrics collection and reporting. Stargate is a data gateway that provides more flexible access to Cassandra via REST, GraphQL and document APIs.
At the start, we used Helm to help manage the installation and configuration of these components. This enabled us to quickly bootstrap the project and begin building a community. Most of the initial interest in the project came from developers in the Cassandra community who didn’t necessarily have much Kubernetes expertise and experience. Many of these folks found it easier to grasp a package management tool and installer like Helm, than an operator and custom resource definitions (CRDs). That’s not to say that Helm is for the “less Kubernetes savvy,” because a big part of the Kubernetes ecosystem uses Helm.
How it’s Going: Ups and Downs with Helm
As the project grew, we began to run into some limitations with Helm. While it was pretty straightforward to get the installation of K8ssandra clusters working correctly, we encountered more issues when it came to upgrading and managing clusters.
Writing Complex Logic
Helm has good support for control flow, with loops and if statements. However, when you start getting multiple levels deep, it’s harder to read and reason through the code, and indentation becomes an issue. In particular, we found that peer-reviewing changes to Helm charts became quite difficult.
Reuse and Extensibility
Helm variables are limited to the scope of the template where you declare them. For example, we had a variable defined in the Cassandra data center template that we wanted to reuse in the Stargate template, but that wasn’t possible. We had to recreate the same variable in the Stargate template. This prevented us from keeping our code DRY, which we found to be a source of defects.
Similarly, Helm has a nice big library of helper template functions, but that library doesn’t cover every use case, and there is no interface to define your own functions. You can define your own custom templates, which allow for a lot of reuse, but those are not a replacement for functions.
Project Structure and Inheritance
We also ran into difficulties when we tried to implement an umbrella chart design pattern, which is a best practice for Helm. We were able to create a top-level K8ssandra Helm chart with sub-charts for Cassandra and Prometheus, but ran into problems with variable scoping when attempting to create additional sub-charts for Reaper and Stargate. Our intent was to define authentication settings in a single location, the top-level chart, so they could apply not just to Cassandra, but also to Stargate and Reaper. This concept of pushing variables down to sub-charts is not supported by the Helm inheritance model.
Helm can create Kubernetes CRDs, but it doesn’t manage them. We understand that this was a deliberate design choice the Helm developers made for Helm 3. Because the definition of a custom resource is clusterwide, it can get confusing if multiple Helm installs are trying to work off of different versions of a CRD. However, this presented us with some difficulties. To manage updates to resources like a Cassandra data center within Helm, we had to implement a workaround. We implemented custom Kubernetes jobs and labeled them as pre-upgrade hooks, so Helm would execute them on an upgrade. Each job was written in Go and packaged into an image. This is essentially like writing mini-controllers and at some point began to feel like writing an operator.
The Breaking Point: Multicluster Deployments
While we’ve been able to work around these Helm challenges through the 1.3 release, the next major feature on our roadmap was implementing multicluster K8ssandra deployments (K8ssandra/Cassandra clusters that spanned multiple Kubernetes clusters). We realized that even without the intricacies of the network configuration, this was going to be a step beyond what we could implement effectively using Helm.
Setting a New Direction
In the end, we realized that we were trying to make Helm do too much. It’s easy to get into a situation where you learn how to use the hammer and everything looks like a nail, but what you really need is a screwdriver.
As it turns out, we found some common ground with the creators of the Operator Framework, who have defined a capability model for operators, which we highlight here:
As described in this figure, Helm is most suited for the first two levels of operator functionality, focusing on simple installation and upgrades. Performing more complex operations like failure handling and recovery, autoscaling — plus, we would argue, more complex installation and upgrades — should be implemented in a programming language such as Ansible or Go, rather than a templating language like Helm.
Building an Operator: K8ssandra 2.0
Based on this analysis, the team decided it was time to start building an operator. We’re calling this the K8ssandra 2.x series of releases. The priorities for the 2.0 release are porting over the existing functionality that we have in the Helm charts, making sure the operator has feature parity and adding multicluster support. We still intend to address bugs or gaps in the 1.X release stream, but we’re trying to focus any major new feature work towards the operator.
There’s Still a Place for Helm
In terms of tooling, we don’t see Helm and operators as mutually exclusive. These are complementary approaches, and we need to use each one in terms of its strengths. We’ll continue to use Helm to perform basic installation actions, including installing operators and setting up the administrator service account used by Cassandra and other components. These are the sorts of actions that package managers like Helm do best.
Operator Design and Implementation Choices
There are several key choices we’ve made in the design and implementation of the K8ssandra operator.
While we have separate repositories for Reaper Operator, Medusa Operator and Stargate Operator, we do plan to consolidate those into the K8ssandra operator. The K8ssandra operator will run in a single pod, but will consist of multiple controllers corresponding to each of the CRDs. We’ll have multiple CRDs and multiple controllers. Because cass-operator is already used independently, it will continue to be independent and will be a dependency pulled into the K8ssandra operator.
While this is not currently a microservice architecture, it is decoupled and modular, so we could decide to repackage the controllers as separate microservices in the future if needed.
Implementation in Go Using the Operator SDK
We decided to write the K8ssandra operator in Go, using the Operator SDK. This was an easy choice for us, as we were already familiar with it from cass-operator. We believe that working in a full programming language like Go will be more appealing than working on YAML templates and will help attract new contributors to the project. Coding in Go will enable us to use the full arsenal of the language. For example, Go makes it easy to create helper functions that we can easily reuse.
K8ssandra Cluster-Level Status
The new K8ssandra cluster CRD has a status field to give you an overview of the state of the cluster — whether it’s ready, not ready, initializing, and so forth. This status will summarize the health of all the objects that make up the cluster: the Cassandra cluster, Stargate, Reaper and anything else deployed as part of it. This is not something you can do with Helm.
Stronger Alignment With the Kubernetes Way
Our design approach with controllers for each custom resource is much more aligned with the standard way of managing resources in Kubernetes. For example, we have a particular startup sequence we want to enforce: of not starting Stargate until Cassandra is initialized. With Helm out of the box, there’s no way to do that. We had to add an init container in the Stargate pod that performs a rudimentary check that the cluster is up and running. With the new operator design, the Stargate controller is checking for status changes on Cassandra data center resources. When it gets triggered to run it through its reconciliation, it queries to get the state of the Cassandra data center, and once it is ready, the operator creates the Stargate deployment.
This will also improve testing. There are a lot of test coverage tools out there. For example, we’re using SonarCloud. However, we can’t use SonarCloud with Helm templates, so we don’t really have a good way to measure the level of coverage we have in our tests right now. You also don’t have the same level of support in IDEs that you would for a static language.
Things We’re Still Figuring Out
As we work on developing the operator, there are a few areas in which we’re continuing to explore and learn.
Speeding up Iterative Development
Working with Helm templates is really nice for iterating quickly, but the development steps for an operator are more complex. After modifying operator code, we have to rebuild the operator image and deploy it, then deploy the custom resource that the operator manages so it will then generate the underlying deployment object. Then we can verify the deployment. This process involves more steps, so we’re looking to improve our automation.
Multicluster Integration Testing
Testing multicluster K8ssandra deployments presents some challenges. Up to this point we’ve been able to do most of our continuous integration testing with GitHub Actions, using the free tier runners, but we’ve found this insufficient in terms of resources for multicluster.
One tool that we’re looking at for integration tests is Kuttl. With Kuttl, both the test cases and expected results are described in YAML files, which means that you don’t have to be an expert in Go or the Kubernetes API to contribute tests. We believe this could potentially make it simpler for developers to get involved in testing and make contributions right away, then spin up on Go if they want to, at their own pace.
Should You Use an Operator? Should You Write an Operator?
If you’ve read this far, you’re probably wondering about the implications for your own projects. If you’re using databases or other infrastructure in Kubernetes, it definitely makes sense to use operators to automate as much of your operational workload as possible.
If you’re working for a data infrastructure vendor or contributing to an open source data infrastructure project, you may be wondering how you’ll know when it’s time to invest in building an operator. We put a lot of thought into our own transition, especially in terms of the timing and impact on our users. Ultimately the rule we’d recommend is this: If you find yourself dealing with multiple situations where your tooling is working against you and not for you, then maybe it’s time to consider a different solution.
Building the Community
We’re seeing an increase in contributions to K8ssandra right now, especially in terms of issue creation. Now that we’ve started to pick up momentum with the operator development, it’s a huge bonus to have a growing user community to help us recognize the things that we need to speed up that maturation process.
We want to continue to build the team of those contributing code as well. If you’re interested in running Cassandra on Kubernetes or building operators, we’d love to have you as part of the K8ssandra project. Check out the website and ask any questions you might have on the Forums or our Discord server.