GitOps for Databases on Kubernetes
As applications evolve, so do their database schemas. The practice of automating the deployment of database schema changes has evolved hand in hand with modern DevOps principles into what is known as database migrations.
As part of this evolution, hundreds of “migration tools” have been created to help developers manage their database migrations. These tools range from ORM and language-specific tools like Alembic for Python to language-agnostic tools like Flyway and Liquibase.
Migrations on Kubernetes: The Current State
When Kubernetes came along and teams started to containerize their applications, the knee-jerk reaction was to wrap legacy migration tools in a container and run them as part of the application deployment process.
As oftentimes happens when we try to project old tools into a new platform, the result is a set of shortcomings that need to be addressed. Let’s now review and discuss some of these common practices.
Running Migrations in-app
The simplest approach to running migrations is to simply invoke them during application startup. This doesn’t require us to use any special Kubernetes capabilities. All we need to do is make sure that our migration tool, migration files, and database credentials are available inside our application container. We then only need to change our boot logic to first try to run migrations and if successful start our application.
This is considered an antipattern for several reasons. First, from a security perspective, it is best to reduce the attack surface of your runtime environment to not include anything that isn’t strictly required during runtime. With this pattern both the migration tool and the heightened database credentials that are required to run DDL statements remain lingering in the runtime environment for an attacker to exploit.
Secondly, assuming your application runs multiple replicas for redundancy and availability reasons running migrations as part of application startup forces replicas to load sequentially instead of in parallel. It is very dangerous to try to apply the same database changes from multiple places at once, which is why virtually all tools acquire (or require the user to take care of) some locking or synchronization technique. This means in practice, that a new Pod cannot start up until it has mutually excluded all others from starting up.
If you only have a couple of replicas you might not feel the difference but consider what would happen if you had hundreds of them that now need to race against each other (with the required retries, backoffs, etc) to start up.
Running Migrations as an init-container
A slight improvement on this technique is to use init containers. Kubernetes makes it possible to define an “init container” which is a container that runs before the main container in a PodSpec. Using this approach teams can bring in standalone tools (such as Liquibase or FlyWay) and run them before the application boots.
In addition, the migrations themselves (SQL files) for the schema revisions must somehow be made available to this container either by building a custom image or mounting them from some external source.
This approach is better than running migrations in-app because it removes the migration tool and credentials from the runtime environment but suffers from the same synchronization issues that we demonstrated with in-app migrations.
In addition, consider what happens when migrations fail. Migrations may fail for any number of reasons, ranging from invalid SQL to constraint violations to shaky network connectivity. When your migrations are coupled to your application runtime, any failure in the migration step will leave you with hoards of Pods in a crash looping state which can mean reduced availability or even downtime of your applications.
Running Migrations as Kubernetes Jobs
Kubernetes allows you to execute programs using the “Jobs” API. Similar to using init containers, teams would use a Job that wraps the migration tool and somehow mounts the migration files to execute before the application boots.
The advantage of this approach is that by using Jobs it is possible to make sure that migrations run as a discrete step before the new application Pods start to roll out. Teams commonly use Helm pre-upgrade Hooks or ArgoCD pre-sync hooks to implement this technique.
When combined, the result is migrations that only run once, avoiding the messy “race to migrate” that is exhibited with init containers and are isolated from the runtime environment, reducing the attack surface of applications, as discussed above.
GitOps Principles and Migrations
“We can wrap existing schema management solutions into containers, and run them in Kubernetes as Jobs. But that is SILLY. That is not how we work in Kubernetes.”-Viktor Farcic, DevOps ToolKit
All in all, running migrations as Jobs using ArgoCD or Helm hooks is an okay solution. But examined through the lenses of modern GitOps principles more issues are revealed.
In this context, let’s consider how the migration techniques that we described map to two commonly accepted GitOps principles:
A system managed by GitOps must have its desired state expressed declaratively.
Software agents continuously observe actual system state and attempt to apply the desired state.
Declarative — Virtually all migration tools that are in use in the industry today use an imperative, versioned approach. The desired state of the database is never described but can be deduced from applying all migration scripts in sequence. This means that these tools cannot deal with any unforeseen or manual changes to the target environment in the way that GitOps is supposed to be able to deal with.
Continuously Reconciled — Kubernetes Jobs have a very simplistic way of dealing with failure: brute force retries. If a migration fails, the Job Pod will crash and Kubernetes will try to re-run it again (with some back-off strategy). This might work, but more often than not, migration tools are not designed to deal with partial failures and retrying becomes a futile endeavor.
The Operator Pattern
If running migrations as Jobs is an ill-equipped strategy to satisfy GitOps principles, what is the missing piece?
Kubernetes is an excellent solution for managing stateless resources. However, for many stateful resources, such as databases, reconciling the desired state of a database with its actual state can be a complex task that requires specific domain knowledge. Kubernetes Operators were introduced to the Kubernetes ecosystem to help users manage complex stateful resources by codifying this domain knowledge into a Kubernetes controller.
On a high level, Operators work by introducing new CRDs (custom resource definitions) that extend the Kubernetes API to describe new kinds of resources and providing a controller – a specialized piece of software that runs in the cluster that is responsible for managing these resources in a declarative way using a reconciliation loop.
What if we could use a proper Kubernetes Operator to manage the database schema of our applications?
The Atlas Operator
The Atlas Kubernetes Operator is a Kubernetes controller that uses Atlas to manage your database schema. The Atlas Kubernetes Operator allows you to define the desired schema and apply it to your database using the Kubernetes API.
The Atlas Operator supports a fully declarative flow, in which the desired state of the database is defined by the user and the operator is responsible for reconciling the desired state with the actual state of the database (planning and executing CREATE, ALTER, and DROP statements).
In addition, a more classic versioned workflow is supported as well in which the desired version of the database is provided to the operator and it acts to reconcile the current and actual state of the database to satisfy that.
Using a Kubernetes Operator to manage our database has many advantages:
- It makes schema management a declarative process. — which satisfies GitOps principles but more importantly is much simpler for the end user — they only need to define what they want and get to think less about the how.
- It is continuously reconciled. — Jobs robustness is limited to very basic retry mechanisms as we’ve shown, but operators that have a long-running reconciliation loop have much more means and opportunities to make progress towards the desired state of our application.
- It is semantically richer. — Jobs are a very opaque way of managing resources. Their spec mostly deals with how they run instead of the resource which they represent and their exposed status carries no meaningful information about this resource either. CRDs on the other hand can be managed and manipulated using standard Kubernetes tooling and their status can be consumed in a programmatic way that can be used to build higher-order workflows.
In this article, we have shown some of the existing practices for managing database schemas within Kubernetes applications and discussed their shortcomings. Finally, we demonstrated how the operator pattern can be used to satisfy GitOps principles and take database management forward.