Culture / DevOps / Monitoring / Sponsored / Contributed

5 Tips To Improve Your SRE Incident Metrics

1 Mar 2021 1:38pm, by

Tammy Butow
Tammy is a Principal SRE at Gremlin, where she works on Chaos Engineering. She previously led SRE teams at Dropbox. Tammy is also the co-founder of Girl Geek Academy, on a mission to teach 1 million women technical skills by 2025.

As a Site Reliability Engineer (SRE), you are responsible for maintaining uptime for your customers. But managing and improving your SRE incident metrics isn’t a simple task. You need to build a solid foundation that can set you up for future success, as well as do constant maintenance and make improvements along the way.

As a former SRE Manager at Dropbox and now as a Principal SRE at Gremlin, I have always strived for metrics that help me tell a story in every interaction I have with my stakeholders. Without metrics, it’s very difficult to understand your own systems — let alone explain them to others. Metrics enable you to identify anomalies quickly, to prioritize where to focus your efforts and ultimately to make impactful decisions and improvements for your customers.

“It makes much more sense to use data to tell your story, rather than having data be the story.”

Figuring out how to get started is often the most difficult aspect of establishing and improving your SRE incident metrics. Here are 5 tips for improving your understanding of SRE incident metrics:

1. Avoid Pointless Work 

Do work that is going to be meaningful for your customers. This in-turn will help you feel great about what you do, because you’ll understand the purpose of your work. If something isn’t worth measuring, is it worth doing? I find this to be a great test to help decide if you should (or should not) prioritize specific pieces of work or projects.

2. Understand the Incident Data and the Story You Want to Tell

Get a quick feel for the data that currently exists. Can you get data for incidents, teams, users, alerts, services, servers, and containers or Kubernetes pods? Start in a scrappy fashion by dumping this data in a simple database or spreadsheet, so you can get a feel for it before you productionize it. In the past I’ve started with PagerDuty’s Advanced CSV Export template, a MySQL or Postgres database, Prometheus and Grafana, Tableau and more recently Google Big Query and Google Data Studio. Choose the tool you’re most comfortable with and give yourself a few weeks to learn the data.

3. Understand Your Target Audience and Their Goals/Roadmap

Tell the right story at the right time, to the right people. Focus on building your first dashboard or report in such a way that you could meet with senior leadership (business and technical teams) and explain what’s happening in the world of incident management. They are your target audience for conversations around head count, budget, technology platforms, migrations, etc. Luck is what happens when preparation meets opportunity. Be ready to share your screen and speak about your data.

4. Design and Build Your Solution to Tell Your Story to Your Audience

Determine your current biggest problem to solve by looking at this data before you share it more widely with other engineers and teams. Do you need to reduce the number of incidents? Do you need to load balance engineers across teams based on number of incidents? For example, one team has no incidents but four engineers with excellent SRE-type skills. Another team has hundreds of incidents a week and nobody on the team has SRE-type skills? Then it’s time to load balance expertise across teams!

Determine how you’ll scale out this data and make it more accessible to other teams. In the past I have used the PagerDuty API to design an internal dashboard, for anyone across our engineering team to visualize their PagerDuty metrics. We used the PagerDuty API to automatically pull all that data and visualize it on a custom-built dashboard, using HighCharts. Nowadays there are even easier ways to do this; for example, Grafana has integrated with PagerDuty via custom plugins, that enable you to easily tell your story (an example is the Pareto plugin).

5. Craft a Story to Tell Your Coworkers and Leadership Team 

Crafting a story is the most important aspect of this journey. A common mistake is just giving the data to others without doing any analysis, or determining the story you want to tell. It makes much more sense to use data to tell your story, rather than having data be the story. If you are not practiced in storytelling, I recommend reading Cole Nussbaumer Knaflic’s book, “Storytelling with Data: A Data Visualization Guide for Business Professionals.It’s an excellent read for any aspiring storyteller.

As Knaflic explains:

“Storytelling is not an inherent skill, especially when it comes to data visualization, and the tools at our disposal don’t make it any easier… go beyond conventional tools to reach the root of your data, and how to use your data to create an engaging, informative, compelling story”.

The easiest path to success is to use the tools your company is already using. Do what makes sense culturally for your target audience. This way, there will be less new information for everyone to take in and understand. They’ll be able to focus on the important story you are sharing with them.

Lead image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.