Engineering the Reliability of Chaotic Cloud Native Environments
Cloud native applications provide an advantage in terms of their scalability and velocity. Yet, despite their resiliency, the complexity of these systems has grown as the number of application components continues to increase. Understanding how these components fit together has stretched beyond what can be easily digested, further challenging the ability for organizations to prepare for technical issues that may arise from the system complexities.
Last month, ChaosNative hosted its second annual engineering event, Chaos Carnival where we discussed the principles of chaos engineering and using them to optimize cloud applications in today’s complex IT systems.
The panelists for this discussion:
- Karthik Satchitanand, co-founder and open source lead, ChaosNative
- Ramya Ramalinga Moorthy, industrialization head — Reliability & Resilience Engineering, LTI — Larsen & Toubro Infotech
- Charlotte Mach, engineering manager, Container Solutions
- Nora Jones, founder and CEO, Jeli
In large-scale distributed software systems, it is chaotic when trying to move quickly without breaking things. “There are just innumerable dependencies in an average software stack. Today, there are a lot of versions, and they are changing too fast. The deployment models have changed from what they used to be from a few years back,” Satchitanand said. Creating experiments through validating hypotheses, breaking things in a controlled way, and reaffirming assumptions about systems is all about “picking the right experiment but it is difficult in the cloud native world,” Satchitanand added.
But when you’re breaking things in production, “it’s a really hard place to know where to start. Many companies often want to feel more confident, without necessarily taking the time yet to look at their previous incidents,” said Jones.
To adopt chaos engineering approaches that are quick, while ensuring the best experiments requires “trying to figure out things that could break it, instead of going the way of something’s broken,” said Mach. “A very simple approach to start with is to come up with a lot of chaos use cases,” added Moorthy.
The journey toward chaos engineering is about embracing complexity. According to Satchitanand, “Chaos engineering is about learning new things. We expect things to behave a certain way strongly, because you’ve already tested it before, and you’ve seen it. In practice, there are things that don’t have to behave a certain way, but you’ve not had the chance to see it working that way. Then, there are things which you may have not thought about at all, which and must be checked by running experiments. These are all great ways to approach chaos engineering and create use cases.”
A key challenge of chaos engineering is mapping KPIs and chaos experiments to an ROI. The value is often found in “non-tangible things like trust in the system and hopefully less downtime. But this is the kind of thing that you can’t put a price tag on properly. So that’s why you must find other ways to convince people of the value of it,” said Jones.
But for all the chaos that goes into improving the resiliency in cloud native applications, it leads users to “create new metrics and define new SLO’s that you might not have had a chance to think about. Maybe the scenario forces you to look for information, so you end up creating some metrics. That’s a side benefit of chaos engineering,” said Satchitanand.