What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Cloud Services / Kubernetes / Software Development

Navigating the Trade-Offs of Scaling Kubernetes Dev Environments

How should we balance cost and performance goals when scaling up cloud native environments?
Feb 16th, 2023 8:29am by
Featued image for: Navigating the Trade-Offs of Scaling Kubernetes Dev Environments

Welcome to the cloud native paradise, where development teams can grab the exact preconfigured Kubernetes clusters they need from a self-service engineering platform, and they are ready to scale across hybrid cloud infrastructures in a multitenant way wherever they are deployed.

Self-service provisioning and multitenancy eliminate useless toil and resource constraints from software delivery. Dev teams waste no valuable time getting the clusters they need, when they need them, so it is also cheaper.

Therefore, we can plan to impress the CFO with a better ratio of total value delivered to the business, versus total cost of ownership for the software. Problem solved!

But like everything else in software development, we must expect the unexpected. Developers may still find themselves mucking with infrastructure instead of coding, and surprises inevitably await when it comes time to pay the cloud bill for dev, test, staging and production environments.

How should we balance cost and performance goals when scaling up cloud native environments?

Cloud Costs: Hard to Measure, Hard to Evaluate

By my last count, there are dozens of vendors claiming to reduce public cloud compute and storage fees through various forms of limiting consumption through the account interface.

That’s useful for a one-time cost improvement, but it fails to consider the complexity and hidden costs of meeting the requirements of a cloud native development team with environments sophisticated enough to meet their needs at scale. This leaves platform teams with several open questions:

  • How do you optimize the cost of entry for assembling that “golden path” cloud native architecture?
  • Is redundant labor required to configure Infrastructure as Code scripts and permissions for self-service environments if they don’t come out fully baked?
  • Beyond cloud fees, what are the support costs of maintaining so many dev and staging labs atop an ever-changing Kubernetes stack?

Obviously, a simple cost equation based on fees and licenses can only tell us half of the story about how to value the capacity for improving software delivery productivity, reducing toil and preventing talent churn.

How do organizations arrive at the real value of cost and performance metrics?

Taking a FinOps View of Challenges

FinOps has emerged as a discipline to address the see-saw problem of balancing the CFO’s budget constraints with the CIO’s technology delivery requirements by governing the technology spending decisions of an enterprise against the value of business outcomes generated through technology investments.

The costly environmental pollution of old servers, containers and VM sprawl hounds mature organizations, where proof-of-concept experimental deployment and test environments are often forgotten or abandoned at the end of every project, leading to a lengthy house cleaning.

The introduction of cloud computing and software-as-a-service vendors allowed companies to replace big capital (or capex) outlays for data centers, hardware and enterprise silos with pay-as-you-go operational expenses (or opex) for resources that could elastically scale in capacity and cost.

This honeymoon didn’t last forever, as cloud environment costs started ballooning by year-over-year multiples in many cases. Companies started realizing that they need to get even more FinOps oversight into opex than they once spent on capex purchases.

Developers naturally want cloud native environments on demand that are scaled to their exact needs. To avoid waiting for clusters to spin up, they build and provision multiple clusters to support each use case in Amazon Web Services (AWS), and then leave them running 24/7, each with its own EC2 control and worker nodes.

What a waste of electricity and cloud fees — 10 times the identical infrastructure is left running 10 times more than it needs to be.

It’s not like AWS or Azure or GCS (Google Cloud Services) want to sell their customers cloud capacity they aren’t going to productively use. But at the same time, a hyperscaler would also never suggest turning off any tenant’s reserved instances or clusters that developers might want to use down the road.

Rightsizing and Right Timing

A core principle of FinOps is rightsizing: paying for and provisioning just the right amount of capacity or resources to get the job done, and nothing more.

Loft Labs offers an interesting approach to rightsizing cloud native development environments with a multitenant Kubernetes platform that shares a control and management plane. This shared platform stack spins up ready “golden state” configurations — with underlying microservices like logging, monitoring and networking in seconds — and spins them down the instant they are no longer in use.

The core technology driving the platform is its open source vCluster technology, which allows multiple virtual clusters to run as momentary workloads within a single Kubernetes namespace while retaining developer work isolation and access controls on a per-vCluster basis.

Early cost-saving estimates of this approach are promising. Loft ran a scenario analysis of an enterprise with 300 single-tenant Kubernetes clusters running on Amazon Elastic Kubernetes Service (EKS), with an annual operation cost of $1,642,800. By using 300 virtual clusters on one shared Kubernetes cluster, that company would instead spend around $997,876 for the year — nearly 40% less. Developers would see no difference in their experience.

Figure 1. Estimated cost analysis of EKS clusters alone versus virtual clusters atop a single shared multitenant EKS cluster. Source: Loft Labs

Additionally, a sleep mode allows vClusters to automatically suspend operations and take “naps” during off-peak usage times, or whenever they are idle, and then refresh in seconds. This takes care of resource usage during irregular project schedules and is estimated to save an additional 30% in cloud costs without affecting developer availability.

The Intellyx Take

Of course, development platform teams could just create unique Kubernetes namespaces for each dev/test environment, and then each could chart their own clusters at will, which is fine if the configuration and cloud costs aren’t a concern for the organization. After all, it’s all free, open source tooling, right?

One of the coolest features of the cloud native development paradigm is that it purposely “leaves the wires hanging” rather than dictating one way to serve complex distributed applications and organizations.

Kubernetes leaves the door open for highly compact, virtual clusters that can share costly cloud resources while still serving a highly distributed multitenant development workforce with high-performance development environments that also save unnecessary labor costs and perform well in budget reviews.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.