Cloud Native / Monitoring

Adrian Cockcroft on ‘Failover Theater’ and Achieving True Continuous Resilience

27 Oct 2020 3:00am, by

How do you failover without falling over? Uptime and reliability are at the core of chaos engineering, the art and science of rooting out your systems’ weaknesses. It’s all about increasing the certainty that your backups and your backup’s backups are going to work.

At this year’s virtual ChaosConf, Adrian Cockcroft, vice president of cloud architecture strategy at Amazon Web Services, talked about the dangers of “availability theater” and how to better ground your system’s reliability in reality. He started by questioning if the audience even has a backup data center and if they’ve ever tested its failover reliability.

“If you have a backup data center but you never failed over to it and are not confident to failover to it in a moment’s notice, you invested a lot of money for a façade of availability,” he said.

Interestingly, in recent years, known outages are more likely to be caused by IT and network problems than power issues. Cockcroft quoted the 1984 book “Normal Accidents” on complex systems having multilayered failures that are “unexpected, incomprehensible, uncontrollable and unavoidable.”

While, like natural disasters, these outages may be unavoidable, you can still do everything in your power to prepare for them. Today we will share Cockcroft’s advice for continuously tested resilience.

Know Your Margins. Know What Can Go Wrong

Cockcroft talked about how we build redundancy for alternatives to failover to, but the ability to failover is sometimes more complex than the thing we are failing over to. Complexity often increases the likelihood of failure. And, as Jimmy Cliff once sang, the bigger they are, the harder they fall.

That last straw didn’t really break the camel’s back, and your systems aren’t just as strong as your weakest link. After all, loads of teams are dedicated to hunting for that weakest link. Cockcroft says to instead think of it as a cable or rope. As it goes from fraying to completely ripped, you can’t blame the failure just on those last few strands.

Cockcroft explained, “You are not really getting the big picture because what happened is you built resilient systems that have lots and lots of redundancy and you gradually use up that redundancy until the rope gets frayed until it actually breaks.”

“Everybody with the best intentions, locally optimizing at every step in the process, will gradually consume all of that margin until the system fails.” — Adrian Cockcroft, AWS

He says it’s more important to capture near misses and establish and measure for your safety margins — before it all goes to hell in a handbasket.

To add to all this, we can’t always trust the sensors, which could be missing updates or have other systems coordination problems. You could have two users fixing the system in two different ways at the same time — and neither is paying that close attention. Updates and patches could be too infrequent.

Cockcroft says that the complexity of both the software and aeronautical industries attempts to maintain control via human beings, automated systems and processes.

With the two crashes of the Boeing 737 Max 8, the plane was the control process and the automated controller was the control system, but the human controllers — namely the pilots — were not trained in what to do with the new model.

“The pilot’s model of the anti-stall automation was not in line with the actual automation and that was one of the reasons why the planes crashed,” Cockcroft said.

What the software industry lacks that the aeronautical industry has is a global logging and notification system that triggers when even the slightest thing goes wrong on a single plane or flight. There’s a demand in software for a better combination of observability and control. The first enables you to really understand your complex systems to then quickly detect a failure and the latter allows you to manage your response to that inevitable failure.

STPA: Hazards Analysis Model

Cockcroft offers the System Theoretic Process Analysis (STPA) by MIT’s Professor Nancy Leveson, which uses a functional control diagram of your system that reflects the constraints that are needed to maintain a successful operation. Like in the example above, it lets you visualize the connections between components and how they are affected by failures.

The STPA model is divided into three layers:

  • The business function
  • The control function that manages the business function
  • The human operators that watch over the control system

In this circumstance, the control system is there to manage the small disturbances, like blocking fraudulent requests, but there’s a limit to what that control plan automation can do. What happens if there’s a big enough disturbance to break the web service?

“Think of the control plane and there’s a limit to what it can do. If you go beyond that limit, you are out of control so if your control plane automation failed. Things that are out of scope — like the controller, and the customers just aren’t getting to the system,” Cockcroft explained.

Once you map it out with an STPA, you start to understand the hazards that could disrupt your successful application processing.

The hazards of both the sensor metrics (in the diagram above: bottom right) and the model (top right) include:

  • Missing updates
  • Zeroed out
  • Overflowed
  • Corrupted
  • Out of order
  • Updates are too rapid
  • Updates are too infrequent
  • Updates are delayed
  • Coordination problems
  • Degradation over time

The solution to these issues is usually mitigation through relocation, like to another server. But then there’s much more complexity for the human controller to try to model.

Cockcroft says instead we should work to simplify the human model with symmetrical patterns.

“The more simple the pattern, the better. And that means you can get your head around what’s going on because you understand it more easily,” he said.

Then you use tooling to enforce that symmetry.

Avoid Floods of Errors Turning into Storm Surges

He argues that moving from data centers to the cloud allows for more of this consistent configuration and automation. As much as you can, work to not introduce something into your model that breaks this symmetry. On the other hand, if something looks like a square, don’t try to smooth it into a circle.

“If something is different, try not to paper it over and make it look the same because it’s going to behave in a different way. And then test to those assumptions. This is the chaos engineering resilience test — testing assumptions where things are the same and different.” — Adrian Cockcroft, AWS

For AWS availability he advocates the Rule of Three. This kind of symmetry means making sure the same data and the same services exist across three zones, each with its own independent failure modes. Assume there’s a natural disaster cutting power to one zone. The system has to detect what’s going on and notify the controllers. You should be able to work with a zone offline with no visible downtime.

The human controller needs visibility into everything, but often there is a “flood of errors” and it’s challenging to figure out what is going on, whether a zone is down or there’s a failure with a particular sensor.

Among the three kinds of controllers and even among different types of the same controllers, there can be disagreements. This is when you run the risk of a retry storm.

“Then as the human control action, they shouldn’t really need to do anything. They are confused, working separately, try to solve different problems, tools are misconfigured — all kinds of things. And we’ve also got this flood of work coming in but the controllers are disagreeing and then the zone fails and the flood of requests causes everything else to fail. So this is the failover and fall-over-after-failure problem. So you can see that it’s pretty easy for the whole system to go down and break everything,” Cockcroft said.

He says one way to deal with this is testing game days. They help make sure your observability, monitoring and alert tooling is correlated and consistently synchronous.

Similarly, you want to prevent work amplification via retry storms, by reducing your retries to zero except at entry and exit points. Similarly, he says to reduce timeouts to drop orphaned requests.

Finally, Cockcroft says you have to do chaos engineering first. He says you should make it a badge of honor to pass the test, so gamify it as much as possible. This is the way you work towards continuous resilience.

Amazon Web Services and Gremlin are sponsors of The New Stack.

Feature image via Chaos Conf.

A newsletter digest of the week’s most important stories & analyses.