There is established tooling for alert routing: Something breaks and it gets directed to an on-call dev or ops person to fix it. Then you try to automate testing to make sure it doesn’t happen again.
Ninety-nine percent of the time, that’s sufficient. But what happens when everything breaks? How does an organization deal with the other one percent of major incidents? You know, those unpredictable, black swan events. The major, headline-stealing outages. The ones that wake an entire panicked team — not just dev and ops, but customer support, legal, PR, and sometimes even HR.
Kintaba is the first — that we know of — truly modern incident management system to orchestrate, respond, and recover from these major outages with a holistic approach that focuses on an entire organization.
The inspiration behind Kintaba is the Japanese art and philosophy of Kintsugi, where you reconnect broken pieces of pottery with gold, silver or platinum filling. Instead of searching for super glue to hide the cracks, this practice highlights them as part of the history and even beauty of the object.
CEO and co-founder John Egan told The New Stack that Kintaba similarly looks at major incident management as something that can make companies stronger and more resilient. Perhaps even more valuable — so long as it doesn’t let history repeat itself.
Automating ‘What Are We Going to Do When Things Go Wrong?’
Proper incident management comes down to asking the right questions and routing them to the right people to answer them. However, when you are dealing with unforeseeable circumstances, you don’t often have the wits about you to quickly make these decisions.
Egan said, typically, when a major incident occurs, “We panic. We send emails. And talk on Slack. And we never learn or write up processes and it happens again in a week.”
Since these major incidents are inherently unique, it becomes about trying to understand an ambiguous situation and then narrowing it down toward resolution.
Egan said that “In major outage world, you are really responding to a symptom. All you know is it’s down for everyone and you don’t know why.”
Kintaba automates the full incident management lifecycle, relying on a mix of best practices automation with a human touch and a focus on preventing a repeat incident in the future.
“We used to approach the world so we can make it so nothing breaks. To now we can say the most reliable, the most resilient companies in the market are the ones that expect things to go wrong.”
— John Egan, Kintaba
Kintaba is built with the unicorn in mind. The FAANGs (Facebook, Amazon, Apple, Netflix and Google) have worked hard for years to learn from these unexpected events and have developed and shared best practices for responding to the unpredictable. It’s those with a billion-dollar valuation — like Kintaba customers Gusto, Very Good Security, and Vercel — that are rapidly growing and simply can’t let a major outage beat them.
Egan describes the Kintaba method of response as incredibly collaborative, not just technically but it makes for an orchestrated response across an organization. Because when a website or service is completely down, you need to call in not just tech, but customer support, marketing and legal, too.
Kintaba integrates with many major collaboration and issue tracking tooling, but not just in one direction. The tool does broadcast updates into a set of specified Slack channels or Zendesk, but it is at the same time logging everything for the postmortem and sending tasks to Jira for follow up.
Kintaba is necessarily integrated across the stack at different places.
Egan said, “The tool itself is the process. The best way to implement a process is to implement a tool, so Kintaba comes with best practices out of the box.”
Instead of bringing in an incident response consultant, he says Kintaba stands on the following pillars:
- Openness of data — across an organization so everyone can see what’s going on
- Definition of roles — for example, if personally identifiable information is compromised, there’s pre-defined automation that legal, compliance and engineering will be pinged right away
- Real-time assembly of your response team — tracking who is responding, what they are doing and talking about, all with timestamps
- Postmortem and reviews — Processes that need to change, what’s learned — and then making sure to distribute it out to the company
“Kintaba lives company wide,” Egan said.
Don’t Worry, Your Systems Will Break Again
In no way is the Kintaba team saying you won’t still have major incidents. Because you will. You just shouldn’t have repetitive, predictable ones because you’ve automated and updated processes to evade that particular catastrophe next time.
Results can range from changing a Terraform configuration to changing namespaces to, at onboarding, teaching employees not to hit a tripwire that breaks the system. Kintaba builds customized knowledge libraries to make organizations more resilient.
Egan referred to last year’s CloudFlare outage that took down half the internet for 27 minutes. He said this major outage occurred in a completely unexpected way that they shared in detail in its public postmortem.
“If it can happen at a company like CloudFlare, it can happen anywhere,” he said.
Egan says Kintaba’s organization-wide approach is more necessary than ever because our increasingly distributed systems are exponentially more complicated.
Why can’t we just use chaos engineering and ample automated and penetration testing to just shore things up? Egan says we can and we should — for that 99%. But nowadays infrastructure is far more abstracted than it was ten years ago. He sees the butterfly effect in place more than ever. A single keystroke in an underlying system can trigger Kubernetes to fall over which creates a domino effect not just knocking out one server but every service.
Performing root cause analysis is increasingly challenging as we are abstracted out from those roots. Egan says the predictable errors are pretty much automated out — if one Kubernetes server goes down, the system will bump up another. But if Kubernetes itself goes down, you just can’t know immediately how or why.
Egan cited the Law of Stretched Systems, which flags our continuous jump to the next new tech thing, has us forgetting to think of consequences. When things inevitably go wrong, these increasingly distributed and abstracted solutions mean that they go more wrong than ever before.
On the other hand, things breaking isn’t new at all.
A Major Incident Isn’t Just Tech, It’s People Learning How to Make Mistakes
Egan said that the tech industry, just like every other one, is used to punishing individuals for mistakes. But firing someone doesn’t usually go anywhere near the root cause of any technical incident.
He said that, by the 1950s, big industry had started to realize that firing people who made mistakes didn’t actually improve efficiency on the factory floor. Instead, they encouraged people to write down what they learn.
“All of the successful companies are proving that it’s rarely a person’s fault when an outage happens. It’s often a systemic issue. You need systems to go in and to where those problems are instead of making it an HR problem.”
— John Egan, Kintaba
Similarly, technical mistakes that lead to these catastrophic incidents are rarely human error. They are much more likely to be contextual and systemic. The project management part of incident management therefore has to be about identifying and solving those systemic problems.
Egan said that the software culture is the one that really has to evolve to move from looking at individuals to blame toward looking at not repeating systemic mistakes. Despite the so-called “fail fast” Silicon Valley mindset, he says culturally we are only beginning to get comfortable with this idea.
Kintaba is designed on the belief that major incident response is not just about paging someone — it’s about bringing a whole team together, mitigating harm while also keeping all stakeholders up-to-date. And then the final phase is the ability to document and learn — deciding how things are going to operate in the future to prevent this issue from happening again.
New Feature ‘Milestones’ Creates Essential Incident Timeline
This week, the company introduced a new service to its platform, called Milestones. While Kintaba logs the full details in the post-mortem, Milestones looks to highlight the key moments. It acts as a sort of at-a-glance remediation timeline that can be broadly understood across the company.
You can reconstruct through a postmortem but then, he said, “here’s a really important moment among this flurry of moments.”
They made this new feature very flexible so companies can be as specific as they want to — like Server 7 went down at 7 p.m. — and can create a milestone automatically via the Kintaba API from an integrated external system. On the other hand, it can be a very human milestone, like the moment you got that first customer complaint call.
Egan says milestones are unique in this human and flexible side that can be defined in real-time during the crisis. They are created in a free form whether text, images, charts or screenshots.
“You are setting up a process to respond to the unknown. It’s important you don’t pre-define — you just need a system,” he said.
Milestones are just another way Kintaba aims to focus on your incident management for you so you can just focus on the incident and move on from it.