KubeCon Panel Offers Cloud Cost Cutting Advice
Back in the days of on-premise compute, reducing costs meant cutting capital expenditures. But with the cloud’s pay-as-you-go model, how can companies realize efficiencies in light of the current economic climate?
“It’s really becoming an … operational expense and impacting companies greatly,” said Aparna Subramanian, director of product engineering infrastructure at Shopify, during a Friday session at KubeCon+CloudNativeConEurope 2023 conference in Amsterdam. “That’s the reason why we have this increased focus on optimizing and doing more with less is the mantra these days.”
Subramanian joined Phillip Wittrock, an Apple software engineer; Nagu Chinnakaveti Thulasiraman, engineering manager in the car infrastructure department at Zalando SE; and Todd Ekenstam, principal software engineer at Intuit for Friday session on, “Cloud Computing’s First Economic Recession? Let’s Talk Platform Efficiency.” The panel looked at three broad categories of reducing costs: Culture, operations and design.
Culture: Measure at App and Service Level to Find Costs
When it comes to reducing costs, the first step is creating a culture of measurement, said Wittrock.
“One thing I think it’s helpful to start with is start out measuring where your big wins are, where do you want to focus? What’s going to move the needle a lot, what’s going to take a long time to do, what’s maybe not going to move it as much but is very easy to get done?” Wittrock said. “Then from there, figure out who the right folks to engage with are, what are the right teams, so you can start looking forward.”
It can also be hard to figure out whose problem it is to increase efficiencies and cut costs, added Subramanian. That’s why it should be a cross-team effort with a financial practice or center of excellence component to it, she said.
“Often we run into the situation where it’s everybody’s problem, but it’s nobody’s problem,” she said. “Having the central team is really important but it’s also important to understand that it now suddenly doesn’t become only the central teams responsibility for making sure the platform is efficient. It has to be a collaboration between engineering, finance, procurement — the team that is negotiating contracts with your cloud vendor or other vendors.”
Ekenstam asked the packed audience for a show of hands to determine who knows what their cloud bill is. He then asked for a show of hands from those who know how much their individual services or applications costs. Not surprisingly, the number was smaller although not insubstantial.
“That’s, to me, the first step you need to know — what you’re spending,” Ekenstam said. “That’s the big challenge, taking that cloud costs, that big bill, and actually breaking it into individual teams, individual applications, because only then when you have that visibility will you know where you have the opportunities to improve.”
Intuit runs a developer portal where it tracks all of its different software assets, whether it’s a service or application, he said. Each has an asset ID that is propagated and tagged to all the resources required to support that service or application, he said. Then, IT aggregates all the billing data attributed based on the service or application, and it provides a number for the development teams. Those numbers also are rolled up and provided to various directors and vice presidents.
“It’s not enough to give a top-level CTO or the CEO the bill — you need to get that visibility to people who can actually make decisions and make changes to how the system operates,” Ekenstam said.
“That level of visibility is really the first starting point when we started looking into things more closely at Shopify,” Subramanian added. “We were able to see clearly from the cloud bill what are the different projects, what are the different clusters, but it’s not exactly helpful, right? Because if you have a multitenant platform, you want to know how much is App A costing and how much is App B costing.”
Identifying application cost can enable the platform team to go to the team or leader and hold them responsible for making the changes necessary to improve the efficiency, she added.
Don’t Automatically Cut Where CPU Is Idle
It may seem like the best plan of action would be to cut wherever there are idle resources, but that’s actually not a great idea because it could interrupt a workload that’s trying to complete, warned Wittrock.
“The idle resources may be an artifact of what are the capabilities of the platform you’re running on? What does it offer and maybe that slack just needs to be there for your availability,” he said.
That’s why it’s important to view the efficiency and waste for each application across a variety of stakeholders.
“Shopify is an e-commerce platform and sometimes we have to reserve and scale up all the way because there’s a big flash sale coming up and that time, you don’t want to be scaled all the way down and you don’t want your Cluster Autoscaler to be kicking in and doing all of these things,” Subramanian said. “There are times when you want to protect your reputation, and it’s not about efficiency.”
That’s where a central finance team can come into play, ensuring that the platform returns to normal load after big peak events like Christmas for Shopify, she added.
“That’s why you need that central finance team because there’s somebody looking at this every day and reaching out to the appropriate teams to take action,” she said.
Operations: Focus on Business Need
Intuit has a number of different patterns to its workload. TurboTax is busiest during tax season, for instance, while QuickBooks is very busy during the traditional 9-to-5 work day, Ekenstam said.
“CPU, memory and compute resources is a big component of cost,” he said. “You need to really see how can you make your clusters and applications run most efficiently to minimize costs, but at the same time, provide the services that you need to.”
Shopify actually prepares for Black Friday and Cyber Monday by disabling autoscaling and scaling all the way up to projected traffic because then the goal is to protect Shopify’s reputation on those high volume days, said Subramanian.
“But at other times, we do leverage autoscaling,” she added. “We use VPA [Vertical Pod Autoscaler] to recommend what the right memory and CPU should be and we make that suggestion to the respective application team using a Slack channel.”
The application team knows the specific nature of their workload, so it’s up to them to review the recommendation and make the appropriate changes, she added.
Autoscaling is a key capability for reducing cloud costs, Ekenstam said, but it isn’t a panacea.
“If we can autoscale, not only your application, but also your cluster up and down, that’s for cost,” he said. “That’s obviously the best, but it does come with some disruption. So how can you minimize that disruption? I think a lot of it starts with making sure the apps can be disruptive.”
Design: Kill Kubernetes Pods to Best Utilize Resources
You can’t launch a pod in Kubernetes and expect that pod to live forever, Ekenstam said. At Intuit, they rotate their clusters every 25 days — a de-scheduler automatically de-schedules pods and reschedules them on another node to ensure it takes full advantage of the node resources, as well as so Intuit can update security patches and Amazon machine image (AMI) on the nodes, he explained.
“It also has a side effect of forcing all those applications to get rescheduled and trains our developers that, ‘Hey, I can’t count on these pods running forever. It’s okay that they terminate. It’s okay that they come back up,’” said Ekenstam. “By doing that, we’ve helped build this culture of understanding how Kubernetes works for the developers.”
Intuit is investigating developing a system that takes the recommendations from the vertical pod autoscaling, the historical metrics from each application, and then using that to make decisions and recommendations for both the VPA and the horizontal pod autoscaling (HPA). The system would integrate those recommendations and then apply them the pipeline using GitOps, he explained.
“If you change the resources of a pod, the change in resource will start back in your pre-pod environment, get tested, validate that it does work in pre-prod and work its way through the pipeline to your production environment,” Ekenstam said. “We don’t want to just suddenly change the resources in production without being able to test it first.”
Profiling Apps for Efficiency
It takes a partnership between the platform or infrastructure team and the applications team, said Subramanian.
“Something that Shopify has been working on recently is continuous profiling of applications because you don’t want to just tell application developers … make sure it’s efficient and optimal at all times,” said “In order to reduce the friction, we have rolled up this continuous profiling feature that every application is getting profiled continuously at a certain sample rate.”
That’s made it easy for developers to look at their app profile and make decisions about CPU usage, processes running, and so on, she added.
“Being able to create such tools and enable the application developers to make the right decision is also a key part of efficiency and optimization,” Subramanian noted.
At Intuit, whenever there is a new release of their platform, they run it through Failure Modes and Effects Analysis (FMEA) testing, which includes a load test, Ekenstam said.
“Then we measure how many nodes did it take to do that workload and that helps us identify some kind of performance regression, and performance regressions are also quite often cost regressions, because if you’re suddenly needing to use more nodes to do the same workload, it costs you more, so that’s another technique that we’ve used to identify and to compare different releases,” he said.
CNCF paid for travel and accommodations for The New Stack to attend the KubeCon+CloudNativeConEurope 2023 conference.