DevOps / Monitoring / Security / Sponsored / Contributed

Top 12 Best Practices for Better Incident Management Postmortems

2 Dec 2020 4:00am, by

StackPulse sponsored this post.

Poorly implemented postmortems for IT incidents can be painful for everyone involved; they cost money, and worse yet, they can fail to address the root cause of the problem. In this post, we will discuss some of the pitfalls of postmortems and introduce several best practices that can help smooth the postmortem process — including choosing the right personnel, creating a culture of accountability, and conducting “blameless” postmortems. In short, we will explain what you need to do to improve the postmortem process for everyone involved.

What Is a Postmortem?

Steve Tidwell
Steve has been working in the tech industry for over two decades, and has done everything from end-user support to scaling a global data ingestion and analysis platform to handle data analysis for some of the largest streaming events on the web. He has worked for a number of companies helping to improve their operations and automate their infrastructure.

According to Merriam-Webster, a postmortem is “an analysis or discussion of an event after it is over.” In the tech world, postmortems meetings are a key component to an overall process of incident management and are conducted after an undesirable outcome in order to determine what went wrong, why it went wrong, and how it can be avoided in the future.

Postmortems are not limited to the tech world. Many industries and organizations utilize this process to create a feedback loop that allows for continuous improvement. Regardless of the industry, though, a postmortem will almost always follow the same basic format:

  1. What was the intended outcome?
  2. What actually happened?
  3. Why did it happen?
  4. How can it be avoided in the future?

Retrospectives vs. Postmortems

Postmortems are similar to Agile retrospectives in that they have a similar intent, but there are a few key differences. Postmortems are normally held as soon as possible after an event or incident occurs. Retrospectives are normally held on a regular basis as part of a wider Agile strategy that includes sprint planning, a daily standup, and a retrospective (which is typically held at the end of the sprint).

Although there are different ways to implement a retrospective, they usually look something like this:

  1. What went well during the project, sprint, or prior period?
  2. What didn’t go so well?
  3. What would we like to see in the future?

What to Avoid in a Postmortem Process

So can postmortems go wrong? Very easily, as it turns out. In an organization without proper accountability or a well-planned postmortem process, the most common problem is usually finger-pointing — or what is sometimes called “The Blame Game.”

Many people can probably relate to this scenario. A poorly moderated postmortem discussion would go something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Developers might answer: “QA signed off. They didn’t have a proper test strategy and let a bug slip into production.”
    QA might answer: “Ops didn’t configure the production environment correctly. If it weren’t for that, we would have caught this before it went out.”
    Ops might answer: “If the code had been written correctly, the application wouldn’t have crashed in the first place.”
  4. Question: “How can it be avoided in the future?”
    Developers might answer: “QA needs to do a better job in the future!”
    QA might answer: “Ops needs to do a better job in the future!”
    Ops might answer: “Developers need to do a better job in the future!”
    Management: “Sigh…”

The Blameless Postmortem

Google’s SRE Book has an excellent postmortem strategy in the chapter entitled, “Postmortem Culture: Learning from Failure.” It discusses why postmortems need to be conducted objectively (hint: people are hard-wired to point fingers) and why collaboration is a better approach (because most people want to learn from their mistakes and make things work better for everyone else too).

A practical implementation of a blameless postmortem would look something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Answer: “The staging and production environments were different. A bug that didn’t manifest in the staging environment manifested in production. That caused the application to crash.”
  4. Question: “How can it be avoided in the future?”
    Answer: “We should include additional checks in the code to improve our ability to catch error conditions and prevent the application from crashing. We should make sure that the staging and production environments are identical. If that’s not possible, we should implement additional testing using a canary deployment (or other means) to catch bugs before they are fully deployed to production.”

The last step should also include a list of actionable items, with an owner assigned to each one. A routine follow-up should also be conducted to ensure that those action items were actually completed in a timely manner.

Notice that at no point in our blameless postmortem scenario did anyone attempt to blame another group. Instead, they conducted an objective analysis of the incident. This process would also include a proper root cause analysis, along with a list of possible remedial actions. You can also get ahead of the blame game by proactively avoiding some common communication mistakes among teams.

Potential Postmortem Pitfalls

The problem with trying to instill an accountable yet blameless culture in organizations is that, as we mentioned earlier, humans tend to be hard-wired to point the finger — whether it’s at themselves or someone else.

For an example of how you can avoid “the blame game,” check out “Blameless postmortems don’t work. Here’s what does.” In short, you want to make sure that your process is solid, you hold people to the process, you always keep in mind that you are dealing with human beings, you are “blame aware,” and you work with your teams to help them understand healthier ways to interact and improve.

Postmortem Best Practices

The following are a few best practices and tips to help you on your journey to a better postmortem process:

  1. Obtain buy-in from management, from the bottom all the way to the top. Without some kind of authority behind your process, it will most likely go nowhere.
  2. Assign a process owner. This individual will be responsible for all followup, including scheduling meetings.
  3. Keep the overall process simple. Complicated processes make gaining acceptance more difficult. A lack of acceptance begets non-compliance.
  4. Create a project in your ticketing system dedicated solely to tracking incident workflow.
  5. Keep the ticket workflow simple.
    •  For example, a simple workflow might be something like:
      1. Incident in progress
      2. Incident resolved
      3. Root cause analysis
      4. Incident followup
      5. Incident closed
  6. Keep the amount of information required for a ticket to a minimum. If you have less fields in the ticket, it will be easier for people to identify the information that will facilitate the process. It will also increase the likelihood that the ticket will be filled out properly.
    • A minimalist ticket might look like the following:
      1. Title
      2. Executive Summary
      3. List of personnel who participated in resolving the incident
      4. Ticket (incident owner)
      5. Incident date
      6. Start and end time of the incident. (We recommend using UTC if you have an organization that spans more than one timezone. This will also help keep the timeline more accurate when reviewing server or chat logs, since correlation is easier when it doesn’t require conversion.)
      7. Incident timeline
      8. What happened?
      9. Why did it happen (ie: RCA)?
      10. Attachments, links, graphs, logs, or other information
      11. Sub-tickets with suggested followup actions
      12. Due date for followup
  7. Enforce ticket creation whenever a major incident occurs. This can be done by the individual, or team responding to the incident, or by an Incident Coordinator.
  8. Once the incident is over, assign the ticket to an owner. The owner will be responsible for following up on the root cause analysis and ensuring that action items that were created during postmortem discussions are completed.
  9. Appoint a process owner to ensure that tickets in the incident project move through the workflow. In addition, the process owner should be responsible for scheduling meetings as needed.
  10. You should initiate a postmortem when you have:
    1. Major outages that impact end users
    2. Failed deployments
    3. Security breaches
    4. Data loss
    5. Missed deadlines
    6. Repeated or unresolved incidents
  11. You should avoid a postmortem when you have:
    1. Minor problems
    2. Proactive maintenance to prevent larger problems
    3. Scheduled work (unless the work itself causes an incident)
  12. Finally, stamp out finger-pointing wherever possible and try to create a culture of “blame-awareness” and cooperation.

This article will point you in the right direction when it comes to postmortems, but there are many variables that organizations will need to assess in order to determine what will work best for them. Keep in mind that the postmortem process itself should be reassessed over time in order to account for changes in requirements and to make sure that it is still optimal for your organization.

There are many excellent articles that describe how different companies have implemented their version of the “blameless postmortem.” In particular, see Blameless PostMortems and a Just Culture, as well as How to run a blameless postmortem, and Tuning Blameless Postmortems.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.