There Is No Shame in Customer-Reported Incidents
I was at a community event over the summer, talking to other incident management practitioners when I heard one of them mention that he was mortified when a recent incident was only uncovered after it was reported by a customer.
I really felt for the guy because I’ve been there before too. And honestly, I bet most people who’ve been involved in managing incidents have also had this experience.
It’s not an uncommon scenario, but it shouldn’t be a painful one. It’s time to remove the shame around the idea of the customer-reported incident by talking about it. I’ll go first.
Customers Are Another Form of Alerts
This was at a company I worked at back in my on-call responder days. We updated our deployment pipelines to use Spinnaker. We assumed since it was used by a lot of big companies, it was perfect for our situation too and was production ready.
We set it up so all deployments and the data about them were stored in Redis, and that’s how Spinnaker knew the current state of deploy. And since that wasn’t complex enough for us, we then used Jenkins to run our tests and build deployable artifacts; when the build turned green, it would launch a deployment pipeline in Spinnaker.
Then one day, Redis died on us, which meant Spinnaker lost all context for what it should be doing and what — and this is a crucial point — it had done in the past. It didn’t even produce bugs, so there was nothing that our monitoring or observability tools could have alerted us to. In fact, we only found out there was a problem because our customer support team notified us that they were getting reports of the fonts looking different on our website. Then came alerts in the form of, “Hey, where’d this page go?”
Because Spinnaker had no idea that it had executed a deployment pipeline for a successful build three months ago, thousands of deployments had been kicked off. The entire website was reverted to one that was three months old. Chaos ensued, and we just turned Spinnaker off. We literally just cut the power to it, then manually deployed the website’s current version.
There was a lot that went wrong here. We weren’t production ready, we architected an overly complex solution, but one thing I don’t count as going wrong with this incident is that we heard about it through customer support.
On the contrary, we were grateful. We never would’ve uncovered this problem — or it at least would’ve taken longer to do so — if it hadn’t been for our customers reporting it. There’s no shame there, only an opportunity to learn and make it right.
Build Trust with Your Customers
Especially for those of us who have a lot of high-tech companies as our customers, it’s understood that bugs happen, products go down or, in my example, websites inexplicably revert to old versions.
Unless you’re dealing with a catastrophic incident involving missing or leaked data, most customers will give you the grace of understanding that these things happen. What you do in these situations is much more important than the fact that the alert came from a customer.
When you’re known for handling your worst moments well, people pay attention. I remember last year when Fastly went down in a big way. The company’s stock actually went up the next day because people were impressed by not only how quickly they remediated the problem but also how quickly and clearly they communicated what was going on.
Slack’s another one that does a great job of this. By treating minor bugs like incidents, and publicly declaring them even when they might affect only a small percentage of people, it’s managed to train people to check their status page or Twitter feed when they experience an issue. And when your customers know what to expect from you, and know they can depend on you, they’re more likely to stick with you.
Expectation-setting like this doesn’t happen accidentally, and it doesn’t happen overnight. Instead, it requires making proactive communication a priority of your incident response process. A few best practices to keep in mind include:
Have a Centralized Source of Truth for Your Latest Updates
This is most likely on a status page. Whether it’s your public status page or customer-specific private status page, it should only host the most up-to-date and accurate information, and it should be easily accessible when a customer is trying to figure out what’s wrong.
Communicate Early, Often and Consistently
When customers are affected, it’s important to get your message out early. You’ll want to make sure that your message is clear, honest and easy to understand. Your message should be unified across your status page, emails, social media and messages from customer-facing teams. It’s essential to train your team if they will be directly speaking with customers, so make sure you know how to work with other departments such as customer success or marketing.
Be Accountable and Take Ownership
Most importantly, you should own the incident. The sentiment “honesty is the best policy” fully applies here, and customers will appreciate your transparency.
Learn from Your Incidents
The key here is to partner closely with your customer support team and treat interactions with them as a two-way street. Customer- or CS-reported incidents give you a front-row look at how your customers interact with your product, and that’s an unparalleled learning opportunity. You get a deeper understanding of not only how your customers use your product, but also what they expect from you during an incident. By implementing these learnings after the incident is resolved, you truly move forward in building trust with your customers.
Remove the Shame
Back to the story I started with about the mortified responder. His comment was met with several, “Yeah, that’s happened to us too” stories from the other responders in the room, which did a great thing: It removed the shame for him.
The more we talk about our incidents, with both our customers and with each other, the more we can move our industry toward a culture of psychological safety and promote a culture of learning from incidents. There will always be incidents. And customers will always be one of the ways that we find out about them. We shouldn’t strive for perfection, but instead for open communication and reflection.