How Can We Prevent the Next System Outage Due to a Software Glitch?

Early last month, on April 3, roughly half of all flights across Europe were delayed or canceled — affecting about half a million travelers. The cause? A software glitch that created a major failure with Eurocontrol’s Enhanced Tactical Flow Management System — the system that coordinates and safely maximizes all flights across European airspace. In describing the incident, Eurocontrol tweeted:
“There has been a failure of the Enhanced Tactical Flow Management System. Contingency procedures are being put in place which will have the effect of reducing the capacity of the European network by approximately 10 [percent]. Further information will be provided as soon as possible.”
Later, Eurocontrol provided the following explanation for the failure:
“The trigger event for the outage has been identified and measures have been put in place to ensure no reoccurrence. The trigger event was an incorrect link between the testing of a new software release and the live operations system; this led to the deletion of all current flight plans on the live system. We are confident that there was no outside interference.”
The Eurocontrol incident presents yet another recent example of a software glitch causing major disruption, losses and inconvenience.
Challenges for Complex Systems
Indeed, there are many challenges and risks associated with complex systems. The Eurocontrol outage is not unlike the Wall Street trading debacle caused by Knight Capital in 2012. In that case, an incomplete upgrade resulted in older code triggering waves of buy/sell orders on Wall Street. The impact? Knight Capital losing $440 million that day.
The risk lies in the fact that there are so much interconnectedness and loose integration between multiple systems. For large systems, this web of dependency requires a level of detail and causality understanding beyond what humans can readily know. This knowledge is needed to predictably and repeatedly deploy changes safely to these systems.
To better understand challenges faced by IT professionals in deploying to complex systems, we can refer to a treatise called How Complex Systems Fail published by MIT, and authored by Richard I. Cook MD. The eighteen points Dr. Cook enumerates and elucidates should sound familiar to DevOps practitioners.
Root Cause of Failure?
Much could be said about the Eurocontrol outage and steps that could have been taken to prevent it. Despite what Eurocontrol stated, many details are still missing, so it is difficult to offer a definitive prescription with any level of accuracy.
Indeed, point seven of Cook’s treatise states “post-accident attribution accident to a ‘root cause’ is fundamentally wrong.” This is because there are multiple contributors to failures of complex systems. In result, multiple, smaller faults for failures may jointly produce a failure. Since software is constantly being updated, any root cause lessons learned might in fact not even apply to the next outage, as a complex system changes with every release.
It’s better to cultivate a safety culture and to have a solid failure plan, even a manual one as Eurocontrol did than it is to attempt to discover who is to blame so you can fire them. Ironically, when failures do occur it makes less sense to fire the “people responsible.” This is because the best preventative prescription for failure reduction is experience with failure itself. Especially when we’re talking about complex systems.
The Case for Continuous Delivery Automation
Again, many things could be said and speculated about this outage. However, it’s worth considering how continuous delivery automation can help minimize risk and reduce deployment cost, provide predictability and auditability to deployments and potentially decrease the response to any outage.
In their book, Continuous Delivery, authors David Farley and Jez Humble postulate that the automation mechanics used for deployments need to be the same for EVERY environment, not just pre-production (prod). This best practice allows automation mechanics to be vetted and tested throughout the release pipeline. This means that by the time changes are ready to be deployed to production, you have every confidence that the production deployment will work as proven and demonstrated dozens of time before.
Automation ensures a level of precision and predictability beyond what humans are capable of, especially against complex systems. Automation offers immense value in ensuring “testing of a new software release and the live operations system” is identical. This is because the same automation mechanics are used in both non-prod and prod environments.
Continuous delivery automation tools, like Release Automation solutions, provide layers of automation that go beyond the act of deployment itself. Capabilities include grabbing all correctly versioned artifacts, direct integration to CI tooling and output, modeling of environment datasets, dependencies mapping, pipeline feedback loops and more. Such solutions are able to help orchestrate and respond to test automation across the many dependent components of complex systems. The result? Helping to ensure that the next software glitch doesn’t happen
CA Technologies is a sponsor of The New Stack.
Feature image via Pixabay.