Writing Custom Kubernetes Controllers: Beware of State Drift
Kubernetes has already become successful for what it does, orchestrating containers, but it’s also influencing cloud computing because of how it does it.
The design pattern of its control plane, based on a declarative API that lets users express their desired state, and a set of control loops, also known as controllers that drive the “real world” toward the users’ desired state, is proving to be general enough to be used to manage more than just containers.
A symptom of this is the sprawl of custom Kubernetes controllers and APIs written by the community to manage all sorts of resources, such as virtual machines, data services, software-defined networks and many more.
However, while writing a toy custom controller is relatively easy, there are many challenges when writing a production-grade one.
At anynines, we are building a whole control plane based on custom Kubernetes controllers to fully automate the life-cycle management of data services. In doing so, we became aware of many challenges associated with production-grade controllers. This article describes one challenge we rarely see discussed: The “real world” can autonomously deviate from the users’ desired state, so it must be watched by controllers just like the Kubernetes API.
The following section describes the problem. Then, we outline possible solutions before summarizing everything in the conclusion.
The definition of A Kubernetes controller, custom or not, that people usually give is: “A process that watches notifications on certain (custom) Kubernetes API objects and processes each such object by modifying the resource described by the object, to make the resource match that object’s desired state.” The act of making the resource described by an API object match the desired state is sometimes called reconciliation.
What resource(s) a controller modifies to reconcile an API object depends on what the API object represents, so it’s different for every controller. Some controllers reconcile API objects by creating, updating and/or deleting other, dependent API objects. This translation of API objects into lower-level API objects can go on for multiple levels, but at some point it stops with a controller that reconciles API objects by applying side effects to something which is not the Kubernetes API.
For example, if you create a StatefulSet API object, it gets translated into some pod API objects, and then for each pod, a container is started; the containers are a side effect outside of the Kubernetes API.
According to the definition at the beginning of this section, all the work a controller does is driven by notifications on API objects: If no new notifications are received because there are no changes to the population of API objects, a controller does nothing. Indeed, this is how many controllers are written. If the controller is one that only creates/updates/deletes API objects and doesn’t directly modify anything outside of the Kubernetes API, there are no issues with this.
If the controller modifies some resources outside of the Kubernetes API, then such a design might be flawed. The reason is that it ignores the fact that the state of those resources — the “real” state — might autonomously drift away from the desired state as described in the API object. In such a scenario, the controller won’t receive any notification, so it can’t drive the state of the resource back to the desired state, and the system is not self-healing.
Let’s see an example, shown in Figure 1. Imagine that we have a controller responsible for directly managing containerized apps. (Of course, you don’t want to write such a custom controller because there’s already Kubernetes itself that solves the problem, but bear with us for the sake of the example). If an API object describing a new app with one replica is created, the controller will spawn up a container running the app. If the container crashes because of a bug in its code, bad inputs, etc., no notification on API objects will ever be received by the controller. Yet, it must learn that the container crashed in order to spawn up a replacement as soon as possible to self-heal!
So for a custom Kubernetes controller that directly modifies resources outside of the Kubernetes API, it’s not enough to subscribe to notifications on Kubernetes API objects. It must also monitor the resources that back those API objects, and reconciliation between desired and real state must be triggered even when the real state drifts away, as shown in Figure 2. There’s also one exception where a controller that only deals with API objects is subject to the same issue: when the controller modifies a resource outside of the Kubernetes API not directly but via Kubernetes Jobs.
If you’re building a simple custom controller, you likely can get away with just translating the reconciled API objects into dependent, Kubernetes built-in API objects, so you don’t have the problem we described. Though, if you’re building a whole control plane based on custom controllers that has to support complex use cases, you’ll likely have to write a controller that modifies something outside the Kubernetes API, and then the problem we’re describing might appear.
Unfortunately, there can be no general solution, as it greatly depends on the nature of the resources described by the custom API objects. However, in the next section, we sketch some ideas that can act as foundations to build solutions.
Ideas for Solutions
Avoid the Problem Altogether
If the custom controller can accomplish its task only by creating/updating/deleting dependent Kubernetes API objects, take this approach; it’s simpler and avoids the problem altogether. The controller must still monitor the state backing the API objects it implements. Although, because that state consists of other API objects, the controller can reuse the same Kubernetes mechanisms that it uses to be notified of the API objects it implements, so Kubernetes automatically solves the problem.
Periodically Re-reconcile the API Object Idempotently
Suppose the actions to modify the resources outside of the Kubernetes API are idempotent. In that case, the controller can periodically re-reconcile each API object even if no new notification is received — Kubernetes has built-in support for this — and re-execute the idempotent actions. If the real state drifts away from the desired one, upon the next reconciliation the real state is fixed. Otherwise, nothing will change since the applied action is idempotent, as shown in Figure 3.
This approach wastes resources because there’s a periodic reconciliation even if there’s no need, and there’s the problem of tuning the reconciliation period.
Poll Real State
If the check on the real state of the resource backing an API object is already part of the reconciliation, like it should, and the reconciliation doesn’t consume too many resources, the controller can periodically re-reconcile each API object. The standard reconciliation logic will poll the real state and correct it if and only if necessary, as shown in Figure 4.
Otherwise, if a complete reconciliation eats too many resources, you can write code that does nothing but periodically poll the real state of each API object and trigger a reconciliation if and only if the real state differs from the desired one, as shown in Figure 5. Such code would run inside the controller process, but it would be separate from the main control loop that fully reconciles API objects.
The problems with these two approaches are:
- resource consumption (there could be a lot of unneeded reconciliations/polls),
- setting a sensible period for the reconciliations/polls, and
- the fact that the controller developer must write the extra logic that does the polling (for the second approach).
Watch Notifications on Real State
If the real state that backs an API object has built-in support to stream notifications on its changes, in the same binary as the controller, you can include logic that listens to such notifications and triggers a reconciliation for the relevant API object when one such notification is received. A real example of this that we had is one where each API object represented a role inside a PostgreSQL database. Because PostgreSQL supports notifications (for example, via triggers), the controller can subscribe to PostgreSQL notifications about role creation/update/deletion and trigger a new reconciliation of the relevant API object upon receiving such a notification. This is equivalent to making the controller driven not just by Kubernetes API notifications, but by “real state notifications” as well, as shown in Figure 6.
Unfortunately, there are cases where it’s not possible because the real state doesn’t support change notifications.
If you’re in such a case, you can still synthesize notifications from changes to the real state via polling, at the cost of writing, maintaining and operating an additional component. You can write and deploy a “poller” that polls the real state and updates the status of the relevant API object whenever it finds a difference between real and desired state. Such status update will result in a notification for the API object that will be received by the controller, which will proceed to reconcile again, fixing the real state.
We described one problem that we often see neglected when writing custom Kubernetes controllers: The real state that backs an API object can drift away from the desired state, and the controller must be notified to reconcile the two states even in this circumstance. Then, we sketched some ideas that can be used as starting points to build a solution to the problem. We hope anyone involved in writing custom controllers can benefit from the awareness of the problem we described and the ideas on how to overcome it.