Use Monitoring Insights to Optimize Cost

One of the side effects of the “gold rush” to the public cloud is that many organizations are now facing rapidly growing cloud costs, including large and sudden spikes. About 55% of respondents to Anodot’s survey say they have been “surprised” by cloud costs or had an incident where cloud costs suddenly spiked. This problem gets worse as software companies scale. Many hyperscale companies have surpassed the point at which 50% of their total COGS are allocated to cloud spend, making cloud infrastructure cost optimization a strategic imperative as I noted in my previous post, Reducing Cloud Spend Need Not Be a Paradox.
In this post, I’ll compare two methods for controlling our cloud cost: cost monitoring vs cost efficiency. I’ll also use a specific example to illustrate how we can gain 10x better efficiency and get closer to an optimum level of efficiency by creating separate, optimized environments for development and production.
Some of the contributors to this cost escalation issue are a result of human error. Let’s look at the following real-world examples:
“One employee selected the wrong EC2 instance, and it cost the company nearly $40,000 over the course of a couple of days before the error was caught and corrected.”
“Internal users have left cloud-based GPUs spinning, even after work on them has stopped.” — Danny Zalkind, Kenshoo’s DevOps group manager
Methods for Achieving Cost-Efficiency
Cost-efficiency is all about matching the right infrastructure to the job.
Cloud offers a wide range of infrastructure resources at varying costs. For example, on EC2 alone AWS currently offers nearly 400 different instances with choices across storage options, networking, and operating systems. Complicating this further is that users can choose from machines located in 24 regions and 77 availability zones around the world.
This is just a small fraction of the options you can choose from to optimize your infrastructure, and this list keeps on growing. For the sake of simplicity, I grouped the primary optimization options into three main categories.
-
Policy refers to usage patterns. For example, de-commissioning policy refers to the case in which a certain workload has a time limit to avoid running unused Idle resources. Autoscaling is another way we use a policy-driven approach to match the infrastructure capacity with the real-time demand. Placement policy can be used to define at runtime the right infrastructure target for a particular workload based on availability, location, etc. Repatriation is a policy in which we use a hybrid cloud to offload some of the workloads into a dedicated and highly optimized cloud infrastructure purposely built to run this specific workload. Dropbox storage or Netflix CDN can be an example of such workloads.
-
HW Profile refers to the choice of the specific compute or storage resource combination that provides the best cost/performance ratio. This category alone includes thousands of possible combinations ranging from Spot to a dedicated bare-metal machine.
-
Architecture refers to the selection of a specific platform architecture such as EKS, ECS, Servlets, etc. Quite often the choice of a platform requires that the application be written specifically for that platform to achieve the best cost/performance ratio.
Cost Monitoring vs. Cost Efficiency
Cost monitoring tells you where your infrastructure costs are being spent, and it may also highlight areas of potential inefficiencies. However, it provides very little direction on how to fix those inefficiencies. As with any other monitoring system, cost monitoring can quickly overwhelm you with data making it hard to filter out the crucial insights from the noise.
Cost efficiency on the other hand continuously looks at how to match specific workloads to the right choice of architecture and infrastructure. Quite often that optimization will involve code or architecture changes, and therefore it requires more dedicated and continuous engineering work.
The following example is a good demonstration of that difference. In this example, we use the same containerized workload and run it on two different platforms on the same cloud provider, EKS and ECS. As can be seen in this benchmark, by choosing ECS over EKS we can save 67%. In this specific case, this optimization comes at the expense of portability.
The lesson from this very simple example is that cost-efficiency is an ongoing engineering task as we constantly have to choose between conflicting tradeoffs that can sometimes have long-term implications and cannot be easily addressed just by throwing a tool at the problem.

In this example, ECS saves 67% over EKS, but at the cost of limiting workload portability.
Achieving the Optimum — the Right Infrastructure for the Job
The theoretical optimum in terms of efficiency is to tailor the infrastructure specifically for each particular workload. This is practically impossible but, nevertheless, it gives us a higher benchmark that we can strive to achieve.
As the number of infrastructure choices and platforms continue to grow it becomes harder to handle the matchmaking exercise between the workload and the infrastructure at a granular infrastructure resource level.
While the number of infrastructure choices is extremely high, the number of types of workload environments is relatively lower. This is especially so if we’re looking at the workloads that consume the bulk of our infrastructure resources.
Environment-as-a-Service (EaaS) provides a means by which we create an optimized stack for each workload environment. An example for such environments types can be:
-
Development and production environments
-
Machine learning environments
-
Environment per project/customer/product
We refer to those environments as “certified environments.”
Example — Optimizing Development vs. Production Environments
In the following example, we took a typical Kubernetes-based environment, which includes the Kubernetes cluster as well as shared infrastructure services such as network, storage, and database.
We created multiple versions of that same stack. The first is optimized for production in AWS and Azure. In this case, we chose a fully managed stack, managed Kubernetes (EKS, AKS), managed storage, and database. For the development environment, we chose a stack that would be optimized for low cost and agility. For that stack, we chose a Minikube and K3S as the lightweight Kubernetes, single-instance storage (Minio), Postgres, and a simple network all running on a single VM.
The following diagram shows the specific mapping of the different flavors of that environment.
Achieving 10X Cost Saving
We used real-time cost monitoring to measure the actual cost per environment. As we expected, the development environment cost was equal to the cost of a single VM and was 10x slower than the plain vanilla production stack.
What’s interesting is that by looking at the detailed resource breakdown we could also see that the number of resources and associated hidden costs included whenever we create a managed resource is significantly higher, as we can see in the following resource breakdown table:
You should note that in this case, we didn’t include things like geo-redundancy — which will double the number of production resources as well as increase the bandwidth and networking cost. The dynamic nature of those production resources also makes the ability to predict the actual cost close to impossible, whereas in the development stack we obtain all the resources in a single VM, which makes cost prediction fairly deterministic.
Final Notes
Cost-efficiency is an ongoing engineering task. The thought that we can achieve higher efficiency just by moving our workload to the cloud or by choosing a cloud-native stack and automation tools, using spot instances where possible, etc. is a good start but would still put us far from achieving the optimum.
A cost monitoring tool can help us detect anomalies as well as show us where we spend our infrastructure cost, but it doesn’t replace the need for ongoing engineering work needed to optimize our stacks.
With the number of infrastructure choices and platforms continuously growing, it’s going to be close to impossible to optimize stacks at the granular infrastructure resource level. EaaS simplifies this engineering work by taking a more cross-grain approach in which we organize our environment into a small number of highly optimized stacks based on their target usage. One of the common examples for such an environment is separating the production and development environments as we demonstrated in the example above.
With this approach, we can get much closer to the theoretical optimum.
It’s Not Just About Cost
The move to Environment-as-a-Service and certified environments brings with it additional benefits other than cost-efficiency (such as better agility) by democratizing our development environment. Stay tuned for more in this regard.