Cloud Services / DevOps / Monitoring / Sponsored / Contributed

How Chaos Engineering Helps You Reduce Cloud Spend

28 Dec 2020 8:35am, by

Gremlin sponsored this post.

Andre Newman
Andre is a technical writer for Gremlin where he writes about the benefits and applications of Chaos Engineering. Prior to joining Gremlin, he worked as a consultant for startups and SaaS providers where he wrote on DevOps, observability, SIEM, and microservices. He has been featured in DZone, StatusCode Weekly, and Next City.

Cloud platforms opened the floodgates for engineering teams to run enterprise-scale applications at much lower cost than traditional on-premises data centers. That said, cloud computing can still get expensive — especially as you scale up your operations. The Flexera 2020 State of the Cloud report found that cost savings was the number one priority for 73% of organizations, and that 23% had gone over budget on cloud spend.

Fortunately, cloud platforms provide a number of cost-optimization features — like resource sizing, on-demand infrastructure, and autoscaling. The trick is knowing how to use these features, while also providing high performance and high reliability in your applications.

In this article, we’ll look at a few different ways you can reduce your cloud spend and how to use Chaos Engineering to do so safely and intelligently.

Right-Size Your Infrastructure

There’s a balance to strike between provisioning enough capacity and not paying for unused capacity, but finding this balance is tough. For example, how do you:

  • Right-size a virtual machine instance so that it isn’t excessively idle, but can still handle changes in demand?
  • Scale down idle resources without inadvertently creating a bottleneck?
  • Know that you can reliably scale your applications?

We need a safe way to validate that our changes are right for our environment; the way we do this is with Chaos Engineering. Chaos Engineering is the practice of deliberately testing systems for failure, by injecting them with precise amounts of harm. By observing how our systems respond to this failure, we can make them more resilient.

How does this apply to right-sizing cloud infrastructure? Imagine we have a group of virtual machine instances that we want to scale once CPU usage reaches a certain threshold (e.g. 80% across all nodes for more than one minute). Traditionally, in order to test this autoscaling rule, we’d either need to wait for traffic to organically reach this threshold, or simulate the traffic ourselves using complex scripts. But with Chaos Engineering, we can easily consume CPU cycles across the cluster. We can then monitor our instances and applications to make sure that:

  • The new systems start up correctly.
  • We can load balance traffic between our systems.
  • The customer experience isn’t negatively affected.

Of course, we also want to make sure that we can scale back down when resources aren’t in use. We don’t want to pay for resources we’re not using. So once your systems scale up, halt your experiment and continue monitoring your instance group to make sure that it automatically scales back down.

Be Smart About Redundancy

Having redundant systems is essential for maintaining service during a failure. Organizations that don’t have redundancy risk losing as much as $220,000 for every minute of downtime. A common strategy is to create a replica of your environment and run it in a separate location (known as active-active redundancy). This has a better chance of protecting you during a major outage, but it’s also extremely expensive. Not only are you doubling your operating costs, but you have the added costs of transferring data between both environments.

Alternatively, you can create a replica of your environment that remains on standby and only operates when the primary fails (known as active-passive redundancy). This has the advantage of being lower cost, but it may take longer to spin up during a failover. In this case, we need a way to test our failover strategy to make sure that the replica automatically kicks in and handles load without downtime.

For example, let’s say we have two virtual machine instance groups placed behind a load balancer. One instance group is our primary group, while the second is our failover group. With Chaos Engineering, we can drop all network traffic between the load balancer and the instances in our primary group, to simulate a regional or zonal outage. We can then monitor traffic flow and application availability to make sure that:

  1. The load balancer detects the primary outage and redirects traffic to the secondary group.
  2. The secondary instance group can start-up and serve traffic with minimal delays.
  3. Users don’t experience significant delays or data loss.

If we fail to meet any of these conditions, we can halt the attack and immediately return the flow of traffic to the primary group while we troubleshoot the problem. Approaching redundancy this way is effective for making sure that your redundant systems are working correctly and that you’re protected in case of an outage.

Find Unused Resources

It’s easy for cloud resources to become abandoned over time, for any number of reasons:

  • Teams create a temporary test or demo environments that they forget to decommission.
  • Misconfigured autoscaling rules create new resources, but don’t remove unused resources.
  • Applications change and no longer use old systems, but engineers keep those systems running because they’re not sure if they’re still in use.
  • Engineers leave the company and forget to document older systems.

The challenge of removing abandoned resources is not knowing whether those resources are still being used. What if that compute instance that’s been running for three years is actually hosting a critical service? Even if the service isn’t critical, will destroying it cause some other, unexpected problem in our application?

Fortunately, we can use Chaos Engineering to test the essentiality of a service without deleting or shutting down the instance. As with redundancy, we can drop network traffic to the host to simulate a host failure, then observe the impact on our application. If we’re worried that this is an important production server, we can lower the magnitude of the attack by adding latency to network calls instead. If we notice that adding a reasonably small amount of latency (e.g. 150ms) has a corresponding effect on throughput, then we’ll know this is a critical server. If not, we can scale up our attack to a blackhole attack. In any case, we can always halt the experiment and return service to normal before we do additional testing.


Reducing cloud spend is an ongoing challenge for SRE teams, especially as cloud platforms roll out new services and features. Chaos Engineering can help reduce your costs by helping you right-size your infrastructure, be more intelligent about redundancy, and uncover unused resources — all while helping you keep your applications running reliably.

A newsletter digest of the week’s most important stories & analyses.