Reducing Cloud Spend Need Not Be a Paradox

Few people know cloud transformation from the perspective of Martin Casado. Recently, he and Sarah Wang published a chatter-provoking analysis titled, “The Cost of Cloud, a Trillion Dollar Paradox,” posted on the blog site for venture capital firm Andreessen Horowitz (a16z). The post challenges conventional thinking about the mania surrounding cloud transformation and cloud migration, and it calls us to think about the impact of cloud infrastructure cost (and the impact this, in turn, has on company valuation).
A recent blog from @a16z made the case that for SaaS companies, the cost of cloud is a drag on their market caps. What @martin_casado told me about the post and the debate that ensued: https://t.co/1mqSLUMSYT
— Belle Lin (@bellelin_) June 23, 2021

The paradox states that cloud is cheaper earlier in a company’s evolution, and it becomes more costly as the company scales. They state it simply: “You’re crazy if you don’t start in the cloud, then you’re crazy if you stay on it.” The paradox is that cloud infrastructure makes your business model possible at smaller scales, but it transforms into a source of value destruction at scale, apparent only after you’re deeply committed to the cloud. In aggregate, this equates to hundreds of billions of dollars of equity value evaporation.
Casado and Wang emphasize repatriation — bringing workloads back to private or hybrid infrastructure from cloud-only models — as the main strategy to optimize infrastructure cost. They tell the story of a billion-dollar private software company with a public cloud spend consuming 81% of the company’s cost of revenue (COR). Among the largest 50 publicly traded software companies, aggregate cloud bills top $8 billion (among those that reveal cloud spend).
It’s curious that cloud repatriation is no more popular than it is. Repatriation can drive a big reduction in cloud spend, and an oft-cited figure is 50% savings. Adopting this for the purposes of making the point, repatriation would, in the example cited by Casado and Wang, result in savings of $4 billion in recovered profit. Consider the broad universe of at-scale software companies utilizing public cloud infrastructure, and you can quickly see that this $4 billion of unrealized profit could be far higher.
The a16z post offers a set of useful recommendations on how to overcome the trillion-dollar paradox, including making cloud spend a KPI, incentivizing engineers to optimize resource consumption, choosing a subset of your most resource-intensive workloads as a place to start, and thinking about repatriation upfront before inertia and lock-in strip away your options for repatriation.
My Take
The growth of infrastructure cost does not always grow in direct proportion to revenue growth. This can lead to shrinking profitability as a company scales, and as a result of that, the growing cost of cloud infrastructure equates to hundreds of billions of dollars of equity value of software companies as they reach scale.
Let’s dig in on this a bit: cloud spend can have a 25x impact on market cap, according to Casado and Wang’s analysis. Applying this, one quickly sees that an additional $4 billion of gross profit can be estimated to yield an additional $100B of market capitalization among these 50 companies alone.
Monitoring service provider Datadog, a publicly-traded company, recently traded at close to 40 times 2021 estimated gross profit and disclosed an aggregate $225 million, three-year commitment to Amazon Web Services‘ in its S-1.
Let’s annualize the committed spend to $75 million of annual AWS costs, and let’s further assume 50% or $37.5 million of this may be recovered via cloud repatriation. This translates to approximately $1.5 billion of additional market cap for the company, just on committed spend reductions alone! If we expand to the broader universe of enterprise software and consumer internet companies, this number is likely more than $500 billion, assuming 50% of overall cloud spend is consumed by at-scale technology companies that stand to benefit from cloud repatriation.
Consider this example of the impact efficiency might have on company valuation. Both MongoDB and Elastic reported nearly identical fiscal year 2021 annual revenue ($590 million and $608 million, respectively). Why is it that the market cap of MongoDB is nearly double that of Elastic ($23.4 billion and $13.6 billion, respectively)? One clue might be the difference in infrastructure use efficiency at Mongo, which uses fine-grain multi-tenancy for its SaaS offering versus Elastic, which uses a separate cluster per tenant. The difference in resource consumption is dramatic.
By decoupling the service from the infrastructure we can create predefined zones of infrastructure that are highly optimized to serve each workload.
In preparing this post, I chatted with Casado, who pointed out to me that the likely difference here is the power of a SaaS model. “The market values cloud revenue about three times more than open source on-premises, and the reason is largely about net revenue retention,” he observed. On-premises open source infrastructure tends to have a high churn rate, (normally 18%). In addition to Elastic and MongoDB, Confluent and others felt this effect as well. Elastic is one company that hasn’t been terribly successful in moving large portions of its offering to SaaS. Mongo and Confluent accomplished this, and Databricks and Snowflake started in the cloud.
Casado also called out the example of Atlassian, which moved its service to a multitenant cloud model in AWS, and in so doing reduced cost by three times. However, this wasn’t because AWS was cheaper. It wasn’t. It was because the re-architected, multitenant model made the service far more lightweight.
Focus on Services, Not Infrastructure
The job to be done here is to optimize cloud spend. Therefore, we need to think of cloud optimization pragmatic terms. Optimization is hard. To be successful we need to stop thinking in terms of velocity of feature development vs efficiency. Instead, we should treat efficiency as yet another first-class citizen feature that needs to be prioritized in our backlog and get the right management attention as any other feature.
Throwing Automation at the Problem Can Make Things Worse
Using automation as an efficiency tool to reduce cost sounds trivial. If so, why are so many cloud implementations still so horribly inefficient — often by orders of magnitude? Automation done right can reduce cost significantly. Quite often, however, the side effect of automation is that it makes it easy for developers to spin up cloud resources and leave them running even when they’re not needed. In Is There an Enterprise Margin Crisis?, Casado points out that…
- Automation isn’t magic: Many companies try to improve margins by automating human processes. This can be technically challenging, and the drive for growth makes prioritization difficult.
- Unoptimized cloud: Private markets push for growth, and so cloud implementations can be inefficient by orders of magnitude. Waiting for when growth slows to correct this — when margins are more important — is rarely trivial.
I’ve run into many companies that moved their monolithic application into Kubernetes, and during that phase they experienced increased efficiency. Fairly quickly, however, the cost of their cloud infrastructure started skyrocketing. Developers started spinning up instances not necessarily for the right reasons: they did so simply because it was significantly easier.
Take Ownership of Your Workload
In automation, there tends to be too much emphasis on infrastructure automation and almost no focus on the automation of the service itself. Based on my experience there is more room for optimization at the service layer than on the infrastructure layer.
Decouple the Workload from the Infrastructure Choice
To achieve optimization at the service layer we need to be able to decouple the service from the choice of infrastructure. In this way, we can allow for better flexibility on choosing the right infrastructure or cloud for the job, and we also leave enough room for future incremental optimization as we grow.
Kubernetes, Terraform, and Ansible Are Not Enough
Kubernetes, Terraform, and Ansible are great tools. They help abstract away and simplify infrastructure management. But they’re simply insufficient:
- Managing infrastructure and the services atop that infrastructure are two different things. This is especially true when you consider day 2 operations such as continuous updates.
- Managing distributed service, multi-Kubernetes clusters, multidata centers, or multicloud is still fairly complex, and these tools offer limited help.
- It’s easy to get lost when you have lots of templates and scripts to manage your infrastructure without having anything that maps all this back into your service.
Regaining control of our services: moving up the stack beyond IaC and Kubernetes
I argue that the biggest potential for overcoming many of these issues and regaining control over our own applications is moving up the stack, thinking of how we manage our services and not just the infrastructure that runs those services. By decoupling the service from the infrastructure we can create predefined zones of infrastructure that are highly optimized to serve each workload (test, production, ML, networking etc). These optimized zones don’t have to live outside of the cloud, as there remains ample room for optimization even within the same cloud — and obviously between clouds. In that context, moving off the cloud becomes another private case of those optimized infrastructure zones. I refer to this as Environment-as-a-Service (EaaS).
The following example illustrates how these ideas can be mapped into a real-world example. In this case, we see how to run the same workload on two different infrastructure stacks: one optimized for production and the other for development. This idea can be similarly applied to other areas.
Frustrated? You are not alone. But there’s hope.
The trillion-dollar paradox need not be a value-destroying trap for successful software companies. By focusing further up the stack and matching services to the right infrastructure choices, incentivizing optimizing behaviors, automating thoughtfully (not reflexively), and having a repatriation strategy before you reach scale, you can be better positioned to reign in costs and retain value for you shareholders.
The cycle of cloud migration, automation, and cost optimization are ongoing processes that require continuous iteration, overcoming and learning from failure, and above all, teamwork. There are many tools that can help you achieve this goal, but at the end of the day, without the right discipline and partners, they can turn against you. As historian Yuval Noah Harari remarked, “A knife can be used to cut vegetables and make great food but it can also be used to kill people: it all depends on how you use it.”
As a start, we need to reset our expectations to solve the paradox. There are options today that allow you to simplify your journey, as noted above. We must start thinking further up the stack, further up the value chain and focus on the service itself rather than the infrastructure and see how we match the right infrastructure to the service and not the other way around.