Different Approaches for Building Stateful Kubernetes Applications

Kubernetes is one of the fastest-growing infrastructure projects in the history of computing. In a short span of five years, it has reached a maturity level of becoming the foundation of modern infrastructure. From managed Container as a Service (CaaS) in the public cloud to an enterprise Platform as a Service (PaaS) in the data center to the edge, Kubernetes is becoming ubiquitous.
During the early days of Kubernetes, it was primarily considered as the platform to run web-scale, stateless services. Stateful services such as databases and analytics workloads were run either in virtual machines or as cloud-based managed services. But with Kubernetes becoming the most preferred infrastructure layer, the ecosystem made efforts to make stateful applications first-class citizens in the Kubernetes universe.
There are multiple techniques to run stateful applications in Kubernetes with each technique having its own merits and demerits.
This article attempts to highlight the key approaches to running stateful applications in Kubernetes, the choices available, and the kind of workloads aligned with each approach. I assume that the readers are familiar with the key building blocks Kubernetes storage infrastructure such as Persistent Volumes, Persistent Volume Claims, and Storage Classes.
Shared Storage for the Cluster
The first approach is integrating the Kubernetes cluster with traditional storage infrastructure exposed via Samba, NFS or GlusterFS. This approach can be easily extended to cloud-based shared file systems such as Amazon EFS, Azure Files, and Google Cloud Filestore.
In this architecture, the storage layer is completely decoupled from the compute layer managed by Kubernetes. There are two ways to consume shared storage in Kubernetes Pods:
1) Native Provisioning: Luckily, most of the shared file systems have volume plugins built into upstream Kubernetes distribution or they have a Container Storage Interface (CSI) driver. This enables cluster administrators to define Persistent Volumes (PV) declaratively with parameters specific to the shared file system or a managed service.
2) Host-based Provisioning: In this approach, a boot script runs on each Node responsible for mounting the shared storage. Each Node in the Kubernetes Cluster will have a consistent, well-known mount point that is exposed to the workload. A Persistent Volume pointed to the host directory through hostPath or Local PV.
Since the underlying storage manages the durability and persistence, a workload is completely decoupled from it. This enables the Pods to get scheduled on any Node without the need for defining node affinity which ensures that the Pod is always scheduled on a chosen Node.
However, this approach is not ideal for stateful workloads that need high I/O throughput. Shared file systems are not designed to deliver the IOPS demanded by relational databases, NoSQL databases, and other write-intensive workloads.
Storage Choices: GlusterFS, Samba, NFS, Amazon EFS, Azure Files, Google Cloud Filestore
Typical Workloads: Content Management Systems, Machine Learning Training/Inference Jobs, and Digital Asset Management Systems.
StatefulSets
Kubernetes maintains the desired state of the configuration through controllers. Deployment, ReplicaSet, DaemonSet, and StatefulSet are some of the commonly used controllers.
The StatefulSet is a special type of controller that makes it easy to run clustered workloads in Kubernetes. A clustered workload typically may have one or more masters and multiple slaves. Most of the databases are designed to run in a clustered mode to deliver high availability and fault tolerance.
A stateful clustered workload continuously replicates the data among the masters and slaves. For this, the cluster infrastructure expects the participating entities (masters & slaves) to have consistent and well-known endpoints to reliably synchronize the state. But in Kubernetes, Pods are designed to be ephemeral which are not guaranteed to have the same name and IP address.
The other requirement of a stateful clustered workload is a durable storage backend that is fault-tolerant and capable of handling the IOPS.
To make it easy to run stateful clustered workloads in Kubernetes, StatefulSets were introduced. The Pods that belong to a StatefulSet are guaranteed to have stable, unique identifiers. They follow a predictable naming convention and also support ordered, graceful deployment and scaling.
Each Pod participating in a StatefulSet has a corresponding Persistent Volume Claim (PVC) that follows a similar naming convention. When a Pod gets terminated and is rescheduled on a different Node, the Kubernetes controller will ensure that the Pod is associated with the same PVC which will guarantee that the state is intact.
Since each Pod in a StatefulSet gets a dedicated PVC and PV, there is no hard and fast rule to use shared storage. But it is expected that the StatefulSet is backed by a fast, reliable, durable storage layer such as an SSD-based block storage device. After ensuring that the writes are fully committed to the disk, regular backups and snapshots can be taken from the block storage devices.
Storage Choices: SSDs, Block Storage Devices such as Amazon EBS, Azure Disks, GCE PD
Typical Workloads: Apache ZooKeeper, Apache Kafka, Percona Server for MySQL, PostgreSQL Automatic Failover, and JupyterHub
Cloud Native Storage
The rise of Kubernetes created new market segments aligned with the cloud native computing initiatives. Since storage is one of the key building blocks of cloud native infrastructure, a new segment for the cloud native storage market has rapidly evolved in the recent past.
Cloud native storage brings the traditional storage primitives and workflows to Kubernetes. Like other services, it is abstracted from the underlying hardware and operating systems. From provisioning to decommissioning, the workflow follows the same lifecycle of a typical Kubernetes resource. Cloud native storage is application-centric which means it understands the context of the workloads rather than being an independent layer outside of the cluster. Like other resources, cloud native storage can expand and shrink based on the workload conditions and characteristics. It has the ability to pool individual disks attached to each Node and exposing them as a single, unified logical volume to Kubernetes Pods.
From installing the storage cluster to resizing volumes, cloud native storage empowers Kubernetes administrators to use familiar YAML artifacts managed by the powerful kubectl CLI. Cloud native storage comes with dynamic provisioning, support for multiple filesystems, snapshots, local and remote backups, dynamic volume resizing and more.
The only expectation cloud native storage platforms have is the availability of raw storage within the cluster which can be aggregated and pooled into one logical volume. The raw storage could be in the form of Direct Attached Storage (DAS) for on-premises clusters and block storage for managed clusters running in the public cloud.
Cloud native storage is to containers what block storage is to virtual machines. Both are logical chunks of storage carved out from underlying physical storage. While block storage is attached to a VM, cloud native storage is available through a Persistent Volume consumed by a container.
Most of the cloud native storage platforms come with a custom scheduler to support the hyper-convergence of storage and compute. The custom schedulers work with the inbuilt scheduler of Kubernetes to ensure that the Pod is always located on the same Node that has the data.
Storage Choices: NetApp Trident, Maya Data, Portworx, Reduxio, Red Hat OpenShift Container Storage, Robin Systems, Rook, StorageOS
Typical Workloads: Any workload that expects durability and persistence