Cloud Cost Management for DevOps
The use of cloud services has increased drastically over the last few years and is continuing to grow, with $397 billion in forecasted spend in 2022 (Gartner). As organizations spend more on cloud services, there is an increasing need for cloud resource management.
This responsibility often falls on the shoulders of DevOps or site reliability team members because they know which services are needed or are currently in use by the infrastructure. Cloud resource management becomes an added set of responsibilities layered on top of DevOps responsibilities.
This team member typically fills the role of an intermediary between the engineering and finance departments, channeling cloud spend information back and forth. The first added responsibility becomes enabling interdepartmental visibility into costs.
As cloud usage and spend grows, cost optimization becomes a priority. This is the point when DevOps begins to be tasked with responding to costs through addressing cost anomalies and right-sizing contracts for cloud services based on actual usage. As cloud usage matures, many organizations begin adopting proactive cost optimization approaches. This includes optimizing purchasing through leveraging discounts provided by one- to three-year commitments, such as Amazon Web Services‘ Reserved Instances and Savings Plans. These commitments must be purchased, renewed, and exchanged in a timely fashion to avoid losing money.
When properly done, cloud cost management can truly save organizations significant amounts of money over on-demand prices. But there is a learning curve. Trial and error have helped us identify a set of near-certain challenges that DevOps team members can anticipate. More importantly, these hurdles can be avoided through properly proactive planning.
1. Accurately Attributing Costs to Teams and Projects
Before you can even begin planning around cost optimization, you need an accurate picture of your current infrastructure and spending landscape. Though this seems like a fairly simple task, cost attribution cannot be done by simply opening your AWS bill. The data provided by your monthly bill is not granular and cannot attribute costs to projects and teams. Furthermore, there are additional expenses that are included in your bill that are not tied with your cost of goods sold, such as R&D expenses associated with training machine learning models.
The best way to accurately allocate costs is through tagging infrastructure. You won’t have to guess which instances were associated with the business unit you are examining. But tagging will only work if done accurately and with discipline; otherwise, your time will be used up retagging. The act of tagging infrastructure requires a fair amount of manual work for each piece of infrastructure in each sub-account. In the worst-case scenario, tagging may require relaunching infrastructure, which could have an application impact. As enforcing tag hygiene can be a challenge of its own, look into cloud management solutions that provide auto tagging.
2. Planning With Finance Teams
Cloud usage and cost information is not easily accessible for all team members across departments. This leads to resource selection becoming a disjointed process where finance team members direct the budgets but rely on DevOps team members to understand costs, usage and select instances that meet the budget. Then there is a back and forth to get approval from finance teams.
One way out of this time-consuming process is to create cross-department visibility. The manual approach is to track purchases and usage through spreadsheets, but this requires continuous maintenance and is prone to human error.
We highly recommend using tools that are designed to work across teams. They should at minimum align historical usage and spend information with customizable business units and present this information in clear dashboards. Part of the challenge of purchasing infrastructure is being able to forecast spend across teams and products, especially when external factors could change service needs. By external factors, we mean potential scenarios like 10% product growth or 10% decreased sales.
While baseline visibility is, in my opinion, a commodity in cloud management tools, forecasting and planning tools are equally important but harder to find (especially in a way that integrates cleanly with a management solution to actualize the forecasts). Ideally, you want a tool that can integrate business knowledge, such as forecasts, budgets and scenario analysis, into purchase recommendations.
3. Optimizing Purchases
Every single instance you use has multiple different ways of being purchased outside of on-demand. You can also change the term length of a contract. The amount you spend upfront. This becomes an NP-Hard problem, where it’s too complex to find the right set of contracts for your infrastructure from thousands of different options. Manually identifying the optimal set of commitments through spreadsheets is nearly impossible because there are simply too many. This leads to many teams achieving high levels of coverage through the easier-to-use savings plans, but this also leaves a lot of money on the table.
There are tools that help with purchase planning, but you have to make sure their recommendations go beyond what AWS Cost Explorer provides. These recommendations should not be taken as a one-size-fits-all approach. A lot of recommendation engines will only give you a set of recommendations based on one contract type, term commitment, or upfront spend. Instead, you should look for a tool that considers every combination possible; otherwise, you will miss opportunities for savings or flexibility.
In this case, flexibility refers to a few different things. For example, Compute Plans can be used across instance families and regions, so they are much more flexible on what they can cover. However, EC2 Reserved Instances can only be used in a single region for a single instance family. It also means taking into consideration different term lengths. Resources with variable usage may not be best suited for longer three-year commitments, while others with more predictable usage may be.
4. Continuous Management and Optimization
Manually executing purchases, exchanges and renewals takes continuous effort. Contracts get purchased at different times, covering different periods of time, making it difficult to remember when to renew a single contract. Set up a system to remind you of expirations as soon as you purchase them, or even better, set up a governance system that will make purchases or exchanges on its own.
Beyond remembering when to renew, continuous monitoring of usage and cost data is important to identify and address cost anomalies immediately. In the manual approach, cost anomalies are only identified after they occur when the bill arrives at the end of the month. This means you have a few weeks of wasted spend you could have avoided. Use real-time cost anomaly detection tools that can alert appropriate team members and the root cause of the issue.
Continuous optimization requires being able to use new AWS offerings and prices in your models as soon as they are released (which is quite often). While you can try to do this manually, using a tool that can make recommendations with the latest pricing data will save you a lot of effort.