It’s the type of nightmare that leaves developers in a cold sweat. Imagine waking up to a message from your team that simply says, “We lost a cluster,” but it’s not a dream at all.
InfluxDB Cloud runs on Kubernetes, a cloud application orchestration platform. We use an automated Continuous Delivery (CD) system to deploy code and configuration changes to production. On a typical workday, the engineering team delivers between 5-15 different changes to production.
To deploy these code and configuration changes to Kubernetes clusters, the team uses a tool called ArgoCD. ArgoCD reads a YAML configuration file and uses the Kubernetes API to make the cluster consistent with the code specified in the YAML config.
ArgoCD uses custom resources in Kubernetes (called Applications and AppProjects) to manage the source infrastructure as code repositories. ArgoCD also manages the file paths for these repositories as well as the deployment destinations for specific Kubernetes clusters and namespaces.
Because we maintain multiple clusters, we also use ArgoCD to police itself and manage the definitions of all the different ArgoCD Applications and AppProjects. This is a common development approach, often referred to as the “app of apps” pattern.
We use a language called jsonnet to create a template of the YAML configuration. The CD system detects changes in the jsonnet, converts the jsonnet into YAML, and then Argo applies the changes. At the time of our incident, all resources for a single application were kept in a single YAML file.
The object names and directory structures follow certain naming conventions (app name)–(cluster name) for object names and env/(cluster name)/(app name)/yml for wherein the repository its definition is kept. For example, app01 in cluster01 is defined as app01-cluster01 and its definition is kept under path env/cluster01/app01/yml.
We perform a code review of our Infrastructure as Code, which includes inspecting the resulting YAML and ensuring that it will function as expected before applying the update.
The ordeal began with a single line of code in a configuration file. Someone on the team created a PR that added several new objects to the config file and to the rendered YAML file.
In this case, one of the added objects was a new ArgoCD Application and AppProject. Due to an error in automation, the names of the objects were wrong. They should have been named app02-cluster01, but instead were named app01-cluster01. The code review missed the difference between app01 and app02 so, when rendered, both resources ended up in a single YAML configuration file.
When we merged the PR with the misnamed objects, ArgoCD read the entire generated YAML file and applied all objects in the order they were listed in the file. As a result, the last object listed “wins” and gets applied, which is what happened. ArgoCD replaced the previous instance app1 with the new one. The problem was that the instance of app1 that ArgoCD deleted was InfluxDB Cloud’s core workload.
Furthermore, the new object created an additional workload that we didn’t want to enable on that cluster. In short, when ArgoCD replaced the instance of app01, that process triggered an immediate deletion of an entire production environment.
Obviously, this was not good for our users. When production went down all API endpoints, including all writes and reads, returned 404 errors. During the outage, no one was able to collect data, tasks failed to run, and external queries didn’t work.
Disaster Recovery — Planning and Initial Attempts
We immediately set to work to fix the issue, beginning by reviewing the code in the merged PR. The issue was difficult to spot because it involved an ArgoCD collision between a project and an application name.
Our first intuition was to revert the change to get things back to normal. Unfortunately, that’s not exactly how stateful applications work. We started the reversion process, but stopped almost immediately because reverting the change would cause ArgoCD to create a brand new instance of our application. This new instance wouldn’t have the metadata about our users, dashboards, and tasks that the original instance had. Critically, the new instance wouldn’t have the most important thing — our customers’ data.
At this point, it’s worth mentioning that we store all the data in an InfluxDB Cloud cluster in volumes that use a reclaimPolicy: Retain. This means that even if the Kubernetes resources we manage such as StatefulSet and/or PersistentVolumeClaim (PVC) are deleted, the underlying PersistentVolumes and the volumes in the cloud are not deleted.
We created our recovery plan with this critical detail in mind. We had to manually recreate all of the underlying Kubernetes objects, such as PVCs. Once the new objects were up and running, we needed to restore any missing data from backup systems and then have ArgoCD recreate the stateless parts of our application.
Disaster Recovery — Restoring State and Data
InfluxDB Cloud keeps state in a few components of the system that other microservices interact with, including:
- Etcd: Used for metadata, this exists on a dedicated cluster separate from the Kubernetes control plane.
- Kafka and Zookeeper: Used for Write-Ahead Logs (WALs).
- Storage engine: This includes PVCs and object store for persistence.
The team started by restoring etcd and our metadata. This was probably the most straightforward task in the recovery process because etcd stores a relatively small data set so we were able to get the etcd cluster up and running quickly. This was an easy win for us and allowed us to focus all our attention on the more involved recovery tasks, like Kafka and storage.
We identified and recreated any missing Kubernetes objects, which brought the volumes (specifically Persistent Volume objects) back online and put them in an available state. Once the issue with volumes was fixed, we recreated the StatefulSet, which ensures that all the pods run and cluster in sync.
The next step was to restore Kafka and to do that we also had to get Zookeeper, which keeps metadata for the Kafka cluster, in a healthy state. The Zookeeper volumes also got deleted in the incident. Fortunately, we use Velero to backup Zookeeper hourly, and Zookeeper’s data does not change often. We successfully restored the Zookeeper volumes from a recent backup, which was sufficient to get it up and running.
To restore Kafka we had to create any missing objects related to the volumes and state of Kafka, then recreate the cluster’s StatefulSet one pod at a time. We decided to disable all the health and readiness checks to get the Kafka cluster in a healthy state. This is because we had to create the pods in StatefulSet one at a time and Kafka does not become ready until the cluster leader is up. Temporarily disabling checks allowed us to create all necessary pods, including the cluster leader so that the Kafka cluster reported as healthy.
Because Kafka and etcd are independent of each other, we could have worked on restoring both in parallel. However, we wanted to be sure to have correct procedures in place, so we opted to restore them one at a time.
Once Kafka and etcd came back online, we could re-enable parts of InfluxDB Cloud to start accepting writes. Because we use Kafka as our Write-Ahead Log (WAL), even without storage functioning properly, we could accept writes to the system and add them to the WAL. InfluxDB Cloud would process these writes as soon as the other parts came back online.
As writes became available, we became worried that our instance would get overwhelmed with requests from Telegraf and other clients writing data that buffered while the clusters were down. To guard against this, we resized the components that handle write requests, increasing the number of replicas and increasing memory requests and limits. This helped us handle a temporary spike in writes and ingest all the data into Kafka.
To fix the storage components, we recreated all the storage pods. InfluxDB also backs up all time series data to an object store (e.g., AWS S3, Azure Blob Store, and Google Cloud Storage). As pods came up, they downloaded a copy of data from object storage and then indexed all the data to allow efficient reading. After that process was completed, each storage pod contacted Kafka and read any unprocessed data in WAL.
Disaster Recovery — Final Phase
Once the process of creating the storage pods and indexing existing data was underway, the disaster recovery team was able to focus on fixing other parts of the system.
We changed some of the settings for the storage cluster, reducing the number of replicas for some services to allow the pieces coming back online to start faster. At this point, we re-enabled ArgoCD so it could create any Kubernetes objects still missing.
After the initial deployment and storage engine became fully functional, we could re-enable functionality for key processes, like querying data and viewing dashboards. While this process continued, we started to recreate the proper number of replicas for all resources, and re-enabled any remaining functionality.
Finally, with all the components deployed with the expected number of replicas and everything in a healthy and ready state, the team enabled scheduled tasks and did final QA checks to make sure that everything was running properly.
In total, from the time the PR got merged to the time we restored full functionality was just under six hours.
What We Learned
After the incident, we performed a proper post-mortem to analyze what went well and what we could improve for future incidents.
On the positive side of things, we were able to recover the system without losing any data. Any tools that retry writing data to InfluxDB continued to do so throughout the outage and eventually, that data was written to the InfluxDB Cloud offering. For example, Telegraf, our open source collection agent, performs retries by default.
The most significant problem was that our monitoring and alerting systems did not detect this issue right away. That is why our initial response was to try to roll back the change as opposed to planning and performing a thought-out recovery process. We also lacked a runbook for losing part, or an entire instance of InfluxDB Cloud.
As an outcome of this incident, InfluxData engineering created runbooks focused on restoring state. We now have detailed instructions on how to proceed if a similar situation occurs, i.e., if Kubernetes objects (such as Persistent Volume Claims) get deleted, but the data on the underlying disks and volumes are preserved. We also made sure that all volumes in all our environments are set to retain data, even if the PVC object gets deleted.
We have also improved our process for handling public-facing incidents. We aim to have as few incidents as possible, this should help us in any future problem with our platform that may be public-facing.
On the technical side, we realized our systems should have prevented the PR from being merged and we took multiple steps to address this. We changed how InfluxDB stores generated YAML files, moving to a one object per file approach. For example v1.Service-(namespace).etcd.yaml for an etcd Service. In the future, a similar PR would clearly be shown as an overwrite of an existing object and would not be mistaken for an addition of a new object.
We also improved our tooling to detect duplicates when generating YAML files. The system now warns everyone of duplicates before submitting a change for review. Also, due to how Kubernetes works, the detection logic looks at more than just filenames. For example, apiVersion includes both the group name and version — objects with apiVersion networking.k8s.io/v1beta1 and networking.k8s.io/v1 and same namespace and name should be considered same objects despite the apiVersion string being different.
This incident was a valuable lesson in configuring our CD. ArgoCD allows adding specific annotations that prevent the deletion of certain resources. Adding a Prune=false annotation to all our stateful resources ensures ArgoCD leaves those resources intact in the event of misconfiguration issues. We also add the annotation to Namespace objects managed by ArgoCD, otherwise, ArgoCD will leave StatefulSet, but may still delete the Namespace it is in, causing cascade deletion of all objects.
We also added the FailOnSharedResource=true option for ArgoCD Application objects. This makes ArgoCD fail before attempting to apply any changes to an object that is or was previously managed by another ArgoCD application. This ensures that similar errors, or pointing ArgoCD at wrong clusters or namespaces, would prevent it from causing any changes to existing objects.
One Final Note
While these are all changes we already wanted to make, and the incident spurred us to implement them to improve all our automation and processes. Hopefully, this deep dive into our experience will help you put an effective disaster recovery plan in place.
KubeCon+CloudNativeCon and InfluxDB are sponsors of The New Stack.
Image by Brigitte Werner from Pixabay.