What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Cloud Native Ecosystem / DevOps / Operations

Zombie Resources Eat up Your Cloud Budget

Relying on manual checks for proper infrastructure allocation and utilization adds to the burden of wasted resources and expenses for companies large and small.
Oct 20th, 2023 7:01am by
Featued image for: Zombie Resources Eat up Your Cloud Budget

No organization, regardless of its size, can afford to maintain unused and redundant infrastructure. Often referred to as “zombie” or “orphan” infrastructure, the cost of paying for nothing can pose an existential threat to companies, particularly young startups which can operate on tight runways.

Depending on the site reliability engineering (SRE) team and other key stakeholders to conduct manual housekeeping checks for proper infrastructure allocation and utilization only adds to the burden of wasted resources and expenses for companies large and small.

Even worse are the security risks: The more infrastructure you have, the more you have to secure, and leaving expensive orphaned resources unused presents opportunities for bad actors. In the case of young companies, these resources can often contain sensitive data that was used during development but wasn’t cleaned up afterward.

All Sizes Concerned

Source: Source: Flexera 2023 State of ITAM Report

The costs of cloud infrastructure lingering unused are staggeringly high on an industry-wide basis, according to cloud infrastructure provider Flexera data. Cloud infrastructure bills can easily range in the millions of dollars, and only about two-thirds of this is used efficiently. Small and medium businesses alone spend $200,000 to $400,000 annually on cloud waste alone.

For young startups, unused infrastructure can be a death knell. A startup with a funding round of about $12 million (the 2023 median for Series A) could lose up to 3% of its runway annually due to zombie infrastructure and cloud waste, according to Crunchbase. While this might not seem like much, in an era when VC funding is relatively scarce and companies have longer intervals between successive rounds, a 3% annual loss to waste could easily become catastrophic.

It might be tempting to assume that as a startup matures, its tech will naturally mature with it. But high-growth startups are especially vulnerable to the zombie infrastructure problem regardless of which round of funding they are in.

As the company creates new products or innovates on current ones, its developers provision new infrastructure for experimentation. Companies in this phase tend to operate lean and scrappy, without comprehensive documentation or an assessment of how those resources will be used and when they will be released. Flexera also showed how 80% of small and medium businesses reported managing cloud costs as their biggest challenge.

Large enterprises are not immune — enterprise organizations have reported struggling with cloud expenditure at the same rate. Though it might not pose as much of an existential risk to mature companies, it can seriously handicap their ability to innovate. Large amounts of money lost to zombie infrastructure and waste can reduce a company’s ability to attract top talent, invest in new technology and can hamstring strategic initiatives like an IPO or an acquisition.

The “hidden” costs of not cleaning up these resources can be as bad as the upfront bill. Zombie cloud infrastructure widens the “surface area” of infrastructure available for bad actors to exploit. This is because, as these resources are not regularly used, they tend to be more vulnerable compared to business-critical resources, which typically receive closer monitoring. The average cost of a cloud breach in 2023 was $4.45 million, and companies in highly sensitive spaces like financial services, insurance tech and health care may also lose business due to a loss of public trust, according to an IBM report.

Automate, or Else

So how do these zombies happen in real life? Manual configuration is a key cause, such as a human forgetting to remove block storage attached to other resources. The complexity of cloud platforms and the expertise needed to properly manage them magnifies this risk, according to observability company Virtana data. The over-reliance on platform engineers to manually steward infrastructure was a principal driver of rising cloud costs.

Instead, investing in proper workflow automation can prevent zombie infrastructure from the beginning of the software development life cycle. Instead of burdening SRE teams with manual checks, these automations can reduce their workload by managing the setup and teardown of these resources on the fly. For developer cloud resources in testing and staging environments that “go rogue” or remain unused, the process of removing them can be automated. Consequently, the ability to keep track of where and how cloud resources are deployed and used allows them to also be taken down on demand. This approach lends itself to a self-service model, as the developers can provision resources instantly and, just as importantly, remove them when they are no longer needed.

Developer-centric workflow automation by companies like Nitric take this one step further by directly integrating with product code and acting as a “self-service” platform for product engineers. Nitric works by wrapping around your cloud providers’ SDKs and your application framework. It supports multiple application runtimes, including the Node.js runtime and the Python interpreter, with experimental support for C#, Go and Java. Because it supports multiple cloud providers (currently Azure, AWS and Google Cloud Platform), its syntax is provider-agnostic. Developers can even use the same application and run the same provisioning logic against multiple cloud providers at once. Teams can use the Nitric API to define infrastructure as they write code. When the code is run, Nitric’s platform hooks into the specified cloud provider’s API to provision the resources on the fly.

When it’s time to deploy, the Nitric deployment engine containerizes the application code, data stores and any other necessary runtimes, and deploys them to the cloud platforms you have defined. The Nitric SDK and Runtime Provider then continuously monitor those resources and how they are being used in the code. When Nitric detects that those resources are no longer being referenced, it can de-provision them to reduce the risk of them becoming zombie resources that drive up the organization’s cloud bills.

Implementing an automated provisioning and de-provisioning workflow also saves your infrastructure team time and money in terms of operating hours. The time spent manually cleaning up zombie cloud instances could be better spent advancing key SRE initiatives to further the developer experience and the health of the business as a whole.

High Hopes

It might seem like a stretch to expect cloud provisioning automation to save companies millions of dollars, but the industry is already shifting to this model. Gartner predicts that by 2025, a majority of large enterprises will shift to infrastructure automation in an effort to reduce risk and increase cost savings.

Introducing infrastructure automations into an architecture can unlock significant time and financial savings for developers and engineering teams and put the organization on a path to avoid costly mistakes and oversized cloud bills.

Ultimately, whether organizations grow their product to its full potential will largely hinge on the ability to automate the monitoring and removal of so-called zombie resources. Similarly, automation will also serve to help organizations properly assess the value of all resources to help ensure a growth path.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.