FinOps: It’s All about Culture and Automation
When Amazon’s CTO takes to the stage to ask CIOs to reduce the amount they spend on AWS services, it’s clear that we’ve reached an inflection point in the economics of the cloud.
But that’s precisely what Werner Vogels did at re:Invent 2023 when he told AWS customers they needed to become “frugal architects” and begin managing cloud costs.
And those costs are truly enormous, with Gartner predicting public cloud spending will hit $679 billion in 2024, up from $478 billion in 2022. The explosion in generative AI is fuel to this fire.
But while cloud spend is ballooning, it seems many tech leaders are not sure they have a handle on it. KPMG, the global accounting consultants, found that two-thirds of executives believed their cloud programs had not lowered the total cost of ownership of their IT systems.
However, KPMG also found that this was because of, in large part, a “near-universal failure” to grasp the differences between managing and consuming infrastructure in the cloud compared to managing the legacy, data center-based infrastructure it typically replaces.
Scaling up, Scaling down
In the on-premises world. development teams have “the budget defined before they even start working,” Pini Reznik, co-founder of the consulting firm Container Solutions, and co-author of “Cloud Native Transformation,” told The New Stack.
If dev teams that deploy on-prem want a new piece of hardware, they have to make a business case and work through, often lengthy, procurement and planning processes.
The cloud, by contrast, opens up “the entire world of resources” to engineers, Reznik said. Everything is beguilingly easy, including the provisioning of new servers and scaling up of infrastructure. And, “You don’t need to switch off anything … it can stay on and it saves you 10 minutes in the morning on setting up things.”
Surging costs might not become apparent until the quarterly bill comes in. By which point, says Reznik, “It’s too late to take it back.”
And when the financial costs of architectural decisions are not considered at the outset, projects or applications might be locked into higher levels of resources going forward.
This naturally creates a budgetary headache for the CFO. But the cloud also makes the CFO’s job of predicting future costs harder.
“In a versatile cloud environment, that’s very hard to predict,” Steven O’Dwyer, senior FinOps specialist at ProsperOps, told The New Stack. “And nobody has forecasting figured out because there are constantly new services coming out that are deployed and billed differently.”
Nevertheless, he said, the CFO will aim to establish a budget that CIOs and CTOs must live within. “And so that is pushing CFOs and CTOs to seek out waste so that they can actually have the budget for these high [return on investment] workloads.”
This has fueled the rise of professionals and teams devoted to FinOps, which O’Dwyer defines as the practice of “ensuring that you have both engineering optimization as well as rate optimization being tracked, monitored, addressed, implemented and refined.”
Creating a Cloud Center of Excellence
Unsurprisingly, those FinOps teams are left in what O’Dwyer described as a “Stretch Armstrong” position, looking to ensure engineers have the flexibility they need, while also meeting the CFO’s desire that cloud costs are manageable and predictable.
Ultimately FinOps involves a cultural shift, says O’Dwyer. “It’s not like you can buy a tool and say, ‘OK, I have FinOps.’” The CFO and CTO need to understand each other’s responsibilities and needs. But they also need to work together with their teams to help them understand the broader problem, and find concrete ways to address and manage it.
One practical step is to establish a “cloud center of excellence” that includes members of all these relevant different teams and projects, and that meets regularly. This can then ensure that FinOps efforts — whether financially or engineering motivated — are aligned, and that compromises are agreed upon where necessary.
“Executive sponsorship” for this is essential, O’Dwyer said. Both the financial and engineering organizations must be committed to rooting out waste and taking action — while being cognizant of the need to scale up, and down, as needed. Both sets of leaders and teams have to buy into the concept, or it simply won’t fly.
When it comes to practical steps to turning objectives into reality, visibility into an organization’s cloud resources and usage is essential. And that requires tagging, O’Dwyer said.
The ability to segregate or allocate costs is paramount, he said: “All of the cloud providers allow you to tag resources. Now you can create from the ground up a structure and organizational structure of your cloud environment that makes it easy to track.”
Running a ‘Scream Test’
Once the cloud center of excellence has established the right tagging requirements, he said, they have to be enforced.
This can be through governance tools, whether open source, native or third-party. And of course, the FinOps team must ensure the tags are being used.
The FinOps team — and their supporters in the CFO’s and CTO’s offices — might think about conducting a ”scream test” to encourage this. O’Dwyer recalls how, on one project he was involved in, with the CTO’s backing, engineers were given three months to tag all their cloud resources. The FinOps team monitored which resources were tagged.
The screaming part? “After three months, we literally started stopping resources, and we’d remove access, stop them, and then wait and see who screamed.”
If production or testing resources broke, it was clear these were required, and the FinOps team would tag them, and turn them back over to the relevant team.
Needless to say, executive sponsorship and support are essential for this kind of exercise, says O’Dwyer. “I think the biggest thing that CFOs and CTOs can do is to allow their team to do stuff like that.”
Once workloads and resources have been identified, costs can be allocated accordingly.
“What a lot of companies shift to is chargeback,” says O’Dwyer. This means individual departments or product teams will be responsible for their bill against a set budget. “And that puts the burden on them.”
Why a Chargeback Policy Is a Best Practice
Instituting a chargeback policy might sound like a last resort, but for O’Dwyer, “I think it’s best practice because it holds the engineers accountable.”
After all, the company as a whole has to be competitive against other companies, which will also be running in the cloud. “You have to be conscious of your margin,” he said.
This all will get you part of the way toward managing cloud costs effectively. But if part of the role of FinOps is identifying waste and encouraging engineers to use their resources as efficiently as possible, rate optimization is ensuring they are not paying over the odds for even the reduced amount of capacity they are using.
“The challenge becomes making sure you have the flexibility to enable the engineers to make changes, while still not leaving the company overcommitted with potential wasted commitments,” he says.
This has always been a challenge for engineers. It’s no secret that committing to resources over a fixed period can result in substantial discounts. But simply analyzing previous usage is a major data exercise. Forecasting can veer into guesswork.
Meanwhile, the pricing and discount structures of cloud providers evolve as quickly — and become as complex — as their cloud offerings. Spot instances, savings plans, discount programs and reserved instances all offer opportunities to reduce costs and match the appropriate service to companies’ workloads.
This means FinOps disciplines must be backed up by automation, to maximize an organization’s “effective savings rate,” i.e., the proportion of discount secured on its cloud costs.
“You absolutely have to have automation,” says Reznik. But this is best left to specialists, he says. “If you do it yourself, it’s very unlikely it will be good.”
ProsperOps’ rate optimization approach uses AI and automation to monitor customers’ environments and identify potential cost savings and match these to a cloud provider’s offers.
“With our automation, we work objectively,” O’Dwyer said. “Whereas a normal manually run FinOps team would have to have a great level of context of what their footprint would look like before they make a full-term, rigid commitment.”
Understanding future engineering changes requires a lot of communication, leading to delays in actions, driving missed savings opportunities.
He cites an example of a customer aiming to deal with spiking demand around Black Friday and Cyber Monday. “This company had 40,000 resource changes, where our automation made 3,100 tweaks to their portfolio of discounts or commitment-based discounts to ensure that they had very high coverage all month long.”
Even the most dedicated FinOps professionals would struggle to keep on top of that, even by ignoring their desire to spend time with their loved ones. “Our automation was able to cover them very aggressively for this spike in usage that was driven not by the engineers, but by the end customers.”
And that is perhaps the crux of FinOps and the cloud. The best results come when humans focus on what they’re good at, working together and agreeing on their objectives, then using the machines to achieve them. Of course, it also helps if the humans remember to turn the machines off at the right time as well.