How to Overcome Stuck EBS Volumes When Running Stateful Containers on AWS

I recently authored an article on The New Stack, “4 Reasons Not to Use AWS Elastic Block Storage for Stateful Container Operations,” that argued that while Amazon Web Services is a great place to run containers, you should avoid relying solely on AWS’ Elastic Block Storage (EBS) for persistent storage. I spoke from experience: Our customers running containerized apps in production on AWS have experienced these issues over and over again:
- Slow mount times and stuck volumes, which cause slow deployments.
- Slow failover, which means no high availability.
- Poor I/O, unless you are willing to spend a lot to improve it.
- Fragile volume orchestration via a storage connector.
In that article, I laid out the problems of EBS. But many readers were quick to ask: What is the solution? So, today, I will focus on the solution.
At Portworx, we are proud of the technology we’ve built. But the ultimate test of any technology isn’t pride, but how it stands up to the many and varied failures that come up during production. On Day One, almost everything works as documented. Day Two, Day Three and Day 100 are a different story. Surviving 100 days in production, with the network partitions, server failures, software crashes that invariably come with age, means that a technology is both performant when things are going well and resilient when they are not.
When it comes down to it, our argument against relying solely on Amazon EBS as the persistent layer for containers is framed in these terms. Based on failure testing, EBS alone just isn’t up to the task for containers.
Before diving into the solutions, let’s look in detail at a common container failure mode that Kubernetes and other orchestration frameworks are supposed to handle gracefully: node failure. Here is an example of what a failover would look like if you used EBS as the persistence layer for your stateful containerized app and an EC2 instance to run a Postgres and MySQL database on Kubernetes.
This diagram illustrates our setup before the node failure — our EBS drives are attached to Node A:
If the node running the database pods fail, we will not lose our data because it is stored centrally on EBS. Instead, we just need to restart the pods elsewhere and re-attach the EBS volumes.
The promise of this solution, which is automated via Kubernetes, is evident when the failure happens during the middle of the night. The on-call Ops person doesn’t have to be paged; instead, Kubernetes does the following:
- Notices the node has died.
- Re-schedules the Postgres and MySQL pods to another node.
- Detaches the EBS volumes from the old node.
- Attaches them to the new node.
- Starts the Postgres and MySQL containers on the new node using the EBS drive, keeping the data intact.
This process is illustrated by the following diagram:
This sounds like a dream come true, right?
But talking to many customers who start out with the one-on-one EBS-to-container volume mapping, we’ve found that the problem is that this theoretically simple and elegant example is, in practice, subject to many errors along the way.
The most common problem is a stuck EBS volume; this requires a manual intervention to reboot EC2, which causes application downtime.
Overcoming Stuck EBS volumes with Portworx
So how do you overcome the EBS stuck attaching problem? One answer is to use Portworx.
Portworx takes a different approach from what is outlined above. Rather than provide an EBS volume for each stateful container and attaching, detaching, mounting, and unmounting it each time a container starts on a host, Portworx pools the underlying EBS drives into single data layer and will dynamically provision virtual volumes on top of this storage on demand. This means containers get virtual slices of the underlying storage pool when needed, importantly avoiding the “container startup penalty” associated with attaching block devices.
You can reschedule your pod one, 10, 20, or 100 times, but you will never have to wait for EBS to mount again.
In addition, because Portworx replicates data to multiple nodes, failover is as fast as rescheduling a pod; you don’t have to wait for EBS devices to mount and there is no risk of those mounts getting stuck. In fact, with Portworx replication in place, you don’t even need to use EBS at all. You can use the cheap, fast storage available on your EC2 instances themselves. Yes, this storage is ephemeral, but with Portworx providing persistence beyond the life of a host, it doesn’t matter.
Portworx Provides a Data Layer for Containers
Let’s dive deeper into what we just stated. When you run the Portworx container on each host in the cluster where you want to run Kubernetes pods, Portworx “fingerprints” the available storage on those hosts and combines it with the storage available on every other host in your cluster into a single, cluster-wide data layer. If you run Kubernetes pods on hosts that have an EBS device or two mounted to them, then Portworx provides virtual container volumes on top of those EBS devices.
If each host has only the local EC2 instance storage, it uses that disk to create virtual volumes. The key to all this is that Portworx breaks the one-to-one mapping of a block device to a container volume. Each EBS instance or local disk can provide hundreds of virtual volumes, each provisioned instantly. This completely avoids the costly and error-prone volume attach/detach operation.
How does Portworx avoid attaching and detaching EBS volumes if it uses EBS as the underlying storage? With Portworx, each EBS drive is created, attached, mounted and formatted once, when it joins the storage pool. Typically, these processes are done at the same time you configure each host to run Kubernetes, not when a pod is actually deployed. You can reschedule your pod one, 10, 20, or 100 times, but you will never have to wait for EBS to mount again. Thousands of containers can be started using the same number of EBS drives as were initially configured as part of your cluster because Portworx decouples the underlying storage from the container volumes.
With EBS, you are limited to at most 40 volumes per host, no matter how large the host is, whereas with Portworx, you can run many hundreds of containers per host, each with its own volume. Containers are supposed to be lightweight so we can densely pack them, but the one-to-one EBS mapping breaks that model. AWS documents note the following:
Important – Attaching more than 40 volumes to a Linux instance is supported on a best-effort basis only and is not guaranteed.
Here is an illustration of how the EBS drives are consumed when using the Portworx native Kubernetes Volume Driver:
As you can see, Portworx consumes the underlying storage but decouples the actual drives from the volumes it presents to containers. And because the data is replicated, the failover scenario discussed earlier becomes much simpler (and therefore less error-prone).
Using a Portworx volume, Kubernetes would do the following:
- Notice the node has died,
- Re-schedule the Postgres and MySQL containers to another node,
- Start the Postgres container on the new node using a Portworx volume replica that already exists on the new host (Portworx makes sure that k8s will schedule the container on a host that already has the volume if we use placement constraints).
The following chart illustrates how Kubernetes handles failover when using a Portworx volume:
Because we are no longer enforcing a one-to-one relationship between EBS drives and containers, the following sequence is no longer needed:
- Detach block device from unresponsive old node,
- Attach the block device to the new node,
- Mount the block device to the new container.
Using Portworx with Kubernetes Persistent Volumes and Persistent Volume Claims
Now that we understand how to avoid the container startup penalty with EBS, let’s see how you would use Portworx via Kubernetes.
Imagine we have three nodes, each with two EBS volumes:
- 100GB – spinning disk – low IOPS – /dev/xvdf
- 50GB – SSD – provisioned IOPS – /dev/xvdg
When Portworx is installed on all three hosts, we will have a total storage pool available of 3 x 100GB + 3 x 50GB = 450GB across our three-node cluster.
This heterogeneous pool consists of two types of storage that can be used for applications with differing performance requirements.
Using pxctl (the command line tool to control a Portworx storage cluster) or the CLI of our scheduler or choice, we can create volumes from the underlying storage offered by our EBS volumes. Here is an example of creating a 10GB volume for our Postgres database that has triple replication and high I/O priority (io_priority=high):
$ pxctl volume create \
--size 10G \ # a 10GB disk
--repl 3 \ # maintain 3 copies of the data
--io_priority high \ # control the class of storage used - EBS drives with provisioned iops can be used here
--fs ext4 \ # what filesystem is presented to the container - this can be different per container
postgres-production-volume
We can then use the Kubernetes Volume Driver to create a Persistent Volume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-production-volume
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
portworxVolume:
volumeID: "postgres-production-volume"
fsType: "ext4"
We then create a Persistent Volume Claim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc0001
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
And finally run our Postgres pod, which will use this claim:
apiVersion: v1
kind: Pod
metadata:
name: postgres
spec:
containers:
- name: postgres
image: postgres
volumeMounts:
- name: postgres-prod
mountPath: /var/lib/postgres/data
volumes:
- name: postgres-prod
persistentVolumeClaim:
claimName: pvc0001
Because we’ve set up a Persistent Volume Claim for our pod, if the host our pod is running on dies, it will automatically be rescheduled to a new host, without waiting for EBS volumes to be detached and reattached to the new host, reducing application downtime.
Conclusion
Amazon is a great place to run containers. However, relying on one EBS volume per container has a host of problems. These include:
- Slow mount times and stuck volumes, which equal slow deployments,
- Slow failover, which means no high availability,
- Poor I/O, unless you want to spend a lot of money,
- Fragile volume orchestration via a storage connector.
To get the most out of using AWS as your container infrastructure, without suffering from problems like slow volume attaching or stuck volumes, follow these best practices to avoid the “attaching” penalty that comes with EBS:
- Do not use one EBS volume per Docker container
- Instead, mount a single EBS volume per EC2 instance
- Carve up that volume into multiple virtual volumes
- Instantly mount these volumes to your containers
We hope you will give Portworx a try and find out for yourself how it makes running containers on AWS easier. Next time you are looking for Kubernetes storage or just a persistent storage solution for containers, you can try Portworx for free forever. We’d like to know what you think.