The Resilience Roundtable: A Discussion About Chaos Engineering and More
Gremlin sponsored this post.
2020 was an interesting year… to say the least. The pandemic changed our lives in ways that will outlast the virus itself. Add to the mix a rise in civil unrest, and I think it’s fair to say that we need a reboot in 2021.
In the world of DevOps, there’s quite a bit to be optimistic about. Funding of technology startups has actually increased over the past year. Digital transformations have been accelerated, as companies across all industries prioritize the online experience in a distributed world. Modern tooling like Slack and Zoom have made it possible for many of us to continue working, to stay in touch with loved ones, and even to be entertained as we are stuck at home.
Reliable technology has played a critical role in helping maintain a sense of normalcy and connectedness.
And so I wanted to get a panel together that consisted of some of the premier thought leaders in the space. These founders and executives are on the frontlines building solutions that help companies modernize, solve problems, and become more resilient.
- Kolton Andrus: The CEO and co-founder of Gremlin, the world’s first fully-hosted chaos engineering platform. Previously worked on building robust systems at Amazon and Netflix.
- Charity Majors: The CTO and co-founder of Honeycomb, an observability platform to understand production systems. Previously worked at Facebook as a production engineering manager, focusing on their backend-as-a-service platform Parse.
- John Egan: The CEO and co-founder of Kintaba, a modern incident management platform for your entire organization. Previously built a startup that was acquired by Facebook, where he then led product for their enterprise offering Workday.
- Daniel “Spoons” Spoonhower: The CTO and co-founder of Lightstep, a cutting-edge observability and distributed tracing software. Previously worked at Google and is also the co-founder of the OpenTelemetry project.
- Shahar Fogel: The CEO of Rookout, a live debugging platform enabling developers to debug modern applications faster than ever. Previously was the CEO of Brandtix and the VP of Product at Connectik Technologies.
Watch the full video below:
Key Takeaways from the Resilience Roundtable
Major Outages Impact Companies Both Big and Small
Yes, it’s true that Amazon can lose millions of dollars if they are down for even a few minutes and that Robinhood might lose countless users each time they crash during a major market movement. But for startups, even if they aren’t losing millions of dollars or hundreds of customers, the relative impact on their business can actually be much greater. Losing even a single big customer for a startup can mean losing a significant chunk of revenue. So while big companies make for big headlines, startups can feel the pain of major outages just as much — if not more.
Postmortems Should Be Shared Broadly and Publicly
Creating a culture that accepts failure and learns from it is a major and important shift for many companies. Too often when something goes wrong within traditional organizations, people that weren’t even there (e.g. management) dole out punishment and blame as the primary response. In modern incident management, blameless postmortems are a way to formally document what went wrong and why, in an effort to better understand the incident and prevent it from happening again. These documents should not only be shared with your team — they should also be shared publicly so that anyone interested can learn from what happened. (Cross-company resilience FTW)
You Build It, You Own It!
The best way to get software developers to care about the reliability of their applications… is to put them on call! Skin in the game can make a world of difference. If the engineer knows it’s their pager that will fire in the middle of the night or over the holiday break, they are much more likely to write code that stands up.
Resilience Is Shifting Left
This is a core promise of DevOps: That the daylight between the code being written, and then who is responsible for that code’s behavior in production, becomes narrower and narrower. When we think of shifting more of the operational burden upfront (i.e. Proactive Ops), we may also think of the cutting-edge discipline of Chaos Engineering. Like a vaccine, it’s important to inject a little failure upfront, on your own terms, in order to build longer-term resilience. And for software developers, resilience often means more than just checking if systems are up or down; it means being able to debug customer-facing issues on the fly, and provide a seamless online experience even when the unexpected happens.
Observability Is Real, AIOps Not So Much
Among the panelists, there was a near-unanimous reaction to the term “AIOps” (eye roll). While machines solving all of our problems make for good headlines, the truth is that the human is still very much needed in attributing value to machine-detected anomalies. You’re also adding another project for your engineers to be concerned about — before they wanted to just improve resilience, but now they have to build and maintain the AI to help with that resilience! Simply adopting the best DevOps/SRE practices will likely get you further, for now.
Lightstep is a sponsor of The New Stack.