Every production system has issues. Every production system fails. This is why a team, and the organization as a whole, must go “through the process of change and creating a healthy and supportive culture of learning,” said Amir Shaked, senior vice president of research and development at web application security provider PerimeterX, where they have 300 fully-Dockerized microservices.
Speaking to a common theme across Chaos Carnival, where he gave his presentation, Shaked explained how PerimeterX learned to implement a wide communication channel to help prevent repeated incidents, because it helped bridge trust gaps. One of the most effective ways to do this is through debriefings.
With this in mind, Shaked’s team started looking at repeated issues. Those constant, but seemingly minor production fails where “minor risks become catastrophic as you scale,” he said.
As he looked to examine these repeated issues, things that logically a business would want to fix or prevent in the future, Shaked immediately felt pushback. The team had a strong fear of judgment: Why do you ask so many questions? Why don’t you trust us?
“If you have team members afraid or feeling that they are being judged or insecure in their work environments, they are going to underperform and as a team you are not going to be able to learn and adapt as you should,” Shaked said.
So, about three years ago, he set about setting a new process for the team, focusing on revamping how they analyze different kinds of failure.
Because, he said, “Assuming you have the right foundation of engineers, if you fix the process, anything can happen.”
Shaked shared PerimeterX’s debriefing process with the virtual audience of Chaos Carnival, and now The New Stack shares it with you today.
Debriefs Focus on Root Causes
An incident happens — a customer calls and complains. Usually, that’s how you find out about it.
Shaked said, “When they don’t have a resolution, they page the engineering team, usually waking them up. They find the problem and fix it but will resent the fact they had to wake up to fix it.”
But, he added, “If that’s the end, you will have similar issues again because you don’t have the root causes.”
“Humans make mistakes. This is why we need to fix the process and not try to fix the people.” — Amir Shaked, PerimeterX
The PerimeterX team pinpointed that they were missing that crucial last step — analyzing after the fact to learn lessons and stop recent history from repeating itself.
In their first new debrief, they realized that particular incident was caused by code being deployed into production by mistake. An engineer was merging into the main branch. The code failed the test, but it was late, so the engineer decided to pause everything and then look at it tomorrow.
Shaked said, “What he didn’t know was that the microservice that he was working on, a different addition was made by a DevOps engineer that automatically deployed into production — with an autoscale.”
He said they could have a focus on why there was a merging in the first place, why the developer didn’t know about autoscaling or how microservices are deeply complex and don’t autoscale easily.
Instead, their new debriefing zoomed in on why was there a misunderstanding about how to treat the main branch.
The team all determined together that the main branch equals production. That means, no matter what, any change involving the main branch is considered a drastic change.
Shaked’s team had to intentionally remove judgment from the debriefing process. He says that when you just assume that people are doing their jobs, and when you’re focusing on the process, you can take away the blame and get to the root cause.
Then, as a team matures, the team will take smaller incidents to learn from too. Within 24 to 72 hours after the resolution, PerimeterX has a debrief meeting. Then about two to three weeks after the debrief, they do a checkpoint meeting to make sure the immediate tasks were incorporated.
Conduct a Debrief, Not a Retro.
A retrospective is the most sacred of agile rituals. A retro, as it’s usually called, is used by teams to reflect on their way of working, and to continuously become better in what they do. PerimeterX probably did have a retro to examine their processes for debriefs, but not specific incidents.
A debrief, on the other hand, is a formulaic activity to examine any incident that may have a severe impact on your operation.
One thing retros and debriefs have in common is asking a lot of questions. For PerimeterX’s debriefing sessions, they ask the following:
- What happened? This is a detailed timeline of events. From the moment the issue started rolling into production through to analysis and resolution. As PagerDuty’s Julie Gunderson reminded, a simple chat tool like Slack during the incident helps to timestamp.
- What’s the impact? Shaked says you have to convey the cost impact, how many and which customers were affected, and complaints received. You need to get a full scope, as it’s vital to get everyone to understand why you are delving into the problem. “Understanding the bigger picture, the more you do it, they will focus on that and focus on the bigger impact. And the learning will propagate to have resolutions sooner,” he said.
- How is everything related? Follow-up and action items are necessary for a debrief to be full scope. Try to find patterns, as you learn more about your system and how it fails.
- Did we identify the issue in under a certain amount of time? PerimeterX sets five minutes. You need a timeframe to establish consistency but that timeframe will vary by team.
- How long until we fixed the problem? Again this varies by team from under an hour to within ten minutes to automatically. The goal of chaos engineering is to study your system to both shore it up and to automate as many fixes as possible.
Next comes the discussion of what needs to be done in order to make sure all the above goals are met, followed by a plan of action to make the system even better.
The ‘Drastic’ Cultural Change Driven by Streamlined Debriefs
Shaked said these changes to debriefs led to a drastic cultural change overtime, but that they had to learn from their mistakes along the way.
First and foremost, they uncovered a lack of trust for the then-newly promoted Shaked, who was coming in to “install” that new process and culture.
Inevitably your team will start playing the blame game, which he says you have to nip in the bud as quickly as possible.
“When the focus is on the process and the system, it’s not about who caused the incident. It’s setting the ground to creating the learning opportunities and improvement.” — Amir Shaked, PerimeterX
“If you see it starting to happen, you need to interfere politely and calmly,” Shaked advised.
Keep your debrief narrowly focused on one incident — not broader themes like retrospectives — and focus on the what, not the who. And remember to go easy on the why questions.
He explained, “You need to ask why someone did something, but you don’t want to create self-doubt — you want to focus on the process not the behavior.”
They also realized a debrief is a moot ritual if you don’t include follow-up action items, which you then check back on.
But sometimes you need to communicate in-the-now. That’s why they implemented a crisis mode process — a proverbial big red button, clarifying what is it and when to press it to make sure it wakes up everyone. Because having everyone around the table in a big issue bridges knowledge gaps and leads to a faster solution.
Shaked said a good debrief all comes down to process consistency, so people know the questions they are going to be asked ahead of time, which helps keep everything more positive.
He said, “Keeping calm and making it clear there is a path forward is really important for a change environment, especially when there’s a very serious incident with a very high impact.”
Over the last three years, through the simple act of honed debriefing, PerimeterX has learned some valuable lessons — about both their teams and their systems. But at the top of that list is to never try to fix the humans because you should trust you have a good team but also understand that humans are going to make mistakes.
Chaos Carnival was organized by MayaData, a sponsor of The New Stack.