FAA Flight Cancellations: A Lesson in Application Resiliency

It was, by the Federal Aviation Administration’s own account, a human error that brought down the agency’s Notice to Air Missions (NOTAM) system and in turn, lead to the halt of 32,578 flights within, out of and into the United States last Wednesday morning.
An engineer uploaded a damaged database file. The FAA said personnel failed to follow procedures during routine scheduled maintenance. It was an “honest mistake that cost the country millions,” an FAA official told ABC News.
It’s called a cascading failure, said Courtney Nash, internet incident librarian for IT continuous verification company Verica. She sees it all the time in her work with the Verica Open Incident Database (VOID): A configuration change goes wrong and it triggers chaos in the whole system.
“In the VOID, I’ve got countless examples of configuration change gone wrong, you know — a configuration change then interacted with an AWS setting or something that an API didn’t expect and then it starts freaking out,” Nash said. “They refer to that even in the FAA [incident] as a cascading failure.”
Modernization Challenges
But with human error at the heart of the FAA incident, it raises questions about how companies can ensure they’re not the one in the public hotspot due to a future “cascading failure.”
Such failure is particularly a risk with legacy systems, Nash added.
NOTAM had previously fielded complaints after a near disaster in 2017, when pilots of an Air Canada flight either missed or did not recall that one of San Francisco’s two runways would be closed July 7. The detail, Reuters reported at the time, was buried on Page 8 of a 27-page briefing package.
The Air Canada plane came within seconds of colliding with four other planes when they chose the wrong reference point and tried to land on a parallel taxiway instead, the new report stated.
As a result, NOTAM was recently modernized, which included the addition of a new API that allows developers to easily access verified FAA aviation data such as NOTAMs, Special Activity Airspace and Temporary Flight Restrictions in one of three formats: AIXM, GeoJSON and custom XML.
Bureaucracy can hinder modernization efforts due to slow funding or other institutional challenges, Nash said.
“Old school software models are suffering under that, because they can’t update easily, right?” Nash told The New Stack. “It’s not a continuous development pipeline, where you can test things out and release things smartly and stuff. These old mainframe types of systems, you can’t do that. And we are seeing them creaking and breaking now.”
Incidents such as this and the Southwest cancellations over the December holidays, Nash added, have led to the public being more aware of software resiliency as a concern.
Resiliency as an Application ‘Must’
“The public is very aware of the importance of resilience of systems now,” Nash said. “So for companies, the resilience of our systems is going to become increasingly important. That helps the teams that focus on that and make the case for investing in that.
“If you didn’t invest in the systems and whatever you needed to do to make yours [systems] as resilient as possible, then people are going to start looking at you the same way they’re looking at Southwest Airlines.”
Often, issues that speak to resiliency have been seen as “cost centers,” Nash added.
“There’s not enough investment so that that trend is in tension with the recession, and the huge amounts of layoffs,” she said. “So there’s going to be a priority on resilience. And the question is, will those companies match that with investment in the people to do that? Are they going to just try to pile more on poor people who might not be able to support that? So I think the economic situation is going to make a lot of those things harder.”
A New Role: Incident Analyst
That’s a problematic issue for companies, since Nash, based on her work with VOID data, forecasts that all companies will have incidents in the future. Smart companies, she said, are starting to create a dedicated role for investigating these events — the incident analyst.
Such an analyst should be tech savvy, but also able to investigate and ask the right questions and have the tenacity to find out what triggered the incident, she said.
“These kinds of incidents are not just technical,” Nash said. “Everything we’ve just talked about is actually organizational. It’s people in organizations with computers and software. And if you ignore the first part of that equation, you’re missing a large part of the picture about why and how incidents can happen.”
That’s not to say organizations should be just looking for a way to blame someone — as some in the media seemed to want to do with the FAA engineer who erred, she added.
“We will always have incidents. There is no way to safeguard, to put in guardrails and all pause,” Nash said. “I see a lot of, ‘If you just had all the proper checks, and safety.’ OK, but who’s the person who’s going to sit down and come up with all those? They can’t — they actually can’t, because you can’t see the whole system.”