How Capital One Performs Chaos Engineering in Production
The Fortune 100 company with 100 million customers is 28 years old — young for a U.S. bank holding, but built well before the age of cloud native fintech. Still, in 2020, Capital One became the first bank entirely run on the public cloud, making it one of Amazon Web Services‘ largest customers. This flew in the face of the finance industry’s reticence to shift to the cloud, which MYHSM CEO John Cragg pegs as grounded in uncertainty around regulations, cost and security.
So how does Capital One overcome that reservation to provide always-on banking and credit services solely via the public cloud? With a staunch investment in planned and unplanned chaos engineering and testing in production — and a heavy dose of open source. Read on to learn how controlled chaos reigns at Capital One and what they think they need to take it to the next level and become an elite DevOps organization.
How Capital One Tests in Production
Chaos engineering is a decade-old DevOps practice that brings a positive spin to tech’s obsession to fast and break things. Sort of an oxymoron, chaos engineering is actually controlled experimentation that aims to push services to their limits and to test the processes in place around software resiliency. Instead of being chaotic, it’s methodical, measurable, and, when done well, with a rapid rollback in place. As TNS colleague Maria Korolov writes, companies usually kick off their chaos focusing on what they already know is broken, before moving on to addressing the overall business impact and resiliency goals.
While the principles of chaos engineering specifically focus on withstanding turbulent conditions in production, few organizations have reached the point where they are running their chaos in production. After all, nobody really wants to break customer service.
“Testing in production is considered a bad word, you didn’t test your code,” ahead of release, said Pinos’ colleague and Capital One’s director of stability and site reliability engineering Yar Savchenko. In reality, he says, maintaining consistency between the lower QA environments and production is extremely difficult and expensive, “almost impossible and not worth the trouble.”
Instead, his team generates realistic, high loads — like that of a Friday payday — in production, using the AWS fault injection simulator, AWS systems manager, and other chaos engineering tools, as well as those they developed internally.
They “tool up to fight complexity and assume failure,” Pinos explained, including for monthly game days, as well as “unplanned” — or not previously warned — chaos experiments.
You Can’t Have Chaos without Infrastructure as Code
The Capital One team addresses certain buckets of failure at scale:
- Application layer failure – Internal tool Cloud Doctor helps teams understand the complexity of the environment by stimulating an app layer failure in a small percentage of the production environment, so they can observe what happens and then work to make it more resilient.
- Availability zone failure – Simulating — while everyone is still awake and working — what if a zone goes down, will it automatically reroute to a new zone, and then will containers reconnect automatically
- Regional failures – They need to assure they have the capability to run out of a single region, if one or more go down.
While this is a very technical internal process, it kicks off with a hypothesizing conversation — what would we do if X goes down? — creating disaster scenarios for specific architectural failures.
Pinos says the Capital One team wouldn’t have been able to achieve any of this without standardized deployment via infrastructure as code (IaS). “In order to understand the complexity of all the cloud, you have to invest in tooling,” he continued, and you’ve “got to get rid of the manual intervention through targeted exercises.”
Chaos Engineering for Lower Latency and Higher Capacity
Much of Capital One’s first round of chaos experiments looked to answer:
- How will a critical system perform under extreme load?
- What happens if one of the regions or data centers fail?
- Is an API gateway able to scale?
- Is the load balancer sized correctly?
These answers helped them proactively identify several potential increased latency scenarios with multiple microservices.
“By conducting numerous chaos exercises, both planned and unplanned, we have identified latency. We can’t be the speed of light so the most data you have to push through and the farther your data centers are apart from each other, the more your latency will increase,” Savchenko said at the same talk. The data will just bounce back and forth, he says, and then, “If something changes, like a component fails or your primary database moves from one center or region to another,” things time out and customers are negatively impacted.
The further components are away from each other, the more latency is introduced. “No customer is going to wait 30 seconds for your application to load,” Savchenko continued. Therefore, Capital One focused on moving components and right-sizing components for the cloud.
They also discovered some capacity findings. “The benefit of cloud is you can scale up unlimited, zero to any reasonable number. There is a cost but sometimes you don’t size your cloud native resource correctly, so when the traffic shifts, you don’t have good computing power,” he continued.
Through chaos engineering, they found a number of cases where resources weren’t sized correctly for a spike in user access — “sometimes when we expect — when people get paychecks — but other unpredictable reasons too,” he explained.
And while “Nobody in IT likes processes,” Savchenko said they needed to build a process around what to do next, usually aiming to mitigate issues discovered within 30 days, through configuration changes, capacity expansion or rearchitecting. Typically, he said, sizing and latency defects can be quick to resolve.
The Risk of Testing in Production
As with any experiment run on actual users, there is an inherent risk to testing in production, but, Pinos argues, the value far outweighs the risks.
Unexpected impacts include:
- Increased latency
- Lack of adequate capacity
- Actual failures
- An actual incident occurring at the same time chaos is being performed in the same area of the product
“The key to managing risk is mitigation,” Pinos said. You need to sit testers down with both business and technical stakeholders from the start, to create an agreed-upon playbook on the limits of your experimentation. If you reach that threshold, you abort the test. He says everyone has to agree ahead over rollback triggers and techniques that can be executed in under five minutes — “Anything to eliminate the lack of control.”
Of course, in order to implement chaos engineering at scale, you have to know there’s a problem, which is where real-time monitoring of all critical systems and transactions comes in. First, to understand the steady rate before the test, then monitor during, and then check the steady rate again after.
“If you inject latency into the call path and you don’t verify that latency has gone away, you have a very big problem,” Pinos remarked.
Monitoring is also necessary to measure the effect of the no-notice chaos events, which, so far, have experienced site reliability engineers on standby to assist in case things go really awry. However, the rest of the teams affected aren’t aware. This is when Capital One breaks things on purpose to see if the automation as well as the people and processes are responding as expected — how quickly do engineers jump on a bridge?
Steps Toward Next-Level Resiliency
“You should never stop growing. Status quo is the worst thing that you can achieve in the IT industry,” Savchenko said.
Capital one is currently performing chaos engineering in production for all web and mobile client-facing products. In 2023, they hope to expand to all critical applications in scope, including call centers and interactive voice response systems — which, he says, are usually considered separate but are certainly both part of the customer experience ripe for improvement.
In addition, they are extending their chaos to third-party vendors. “Chaos will allow us to test and proactively identify gaps when utilizing those third-party vendors, including latency when they slow down,” Savchenko said. This also includes planned failures with key communication tools like Zoom, Slack and Splunk, making sure engineers can easily switch to a different tool.
They’re also experimenting with the generation and injection of fake HTTP error codes into downstream applications to see how they react to controlled errors.
The Capital One engineering team is working at being able to execute all sorts of tests on the highest volume days with no advanced notice. The team is working very hard so that any incident that occurs in production is self-healing.
“Our end goal is to ensure that the game day exercises are unannounced… to use a single tool with one click to start the experiment, and if anything goes wrong, to roll back with a single click,” Savchenko said, as they strive to achieve a higher level of resiliency.