No More FOMO: Efficiency in SLO-Driven Monitoring
Efficiency means using the available resources in the best possible way to achieve the desired outcomes. In the current scenario, where every business is facing fierce competition and changing customer demands, efficiency is crucial for survival and growth. Resources include not only money, but also time, productivity, quality and strategy.
IT spending is often a reflection of the market conditions. When the market is booming, companies tend to spend more on IT projects and tools, without being too concerned about the value they are getting from them. This can create some problems, such as having too many tools that are not integrated or aligned with the business goals, wasting resources on unnecessary or redundant tasks, and losing visibility and control over the IT environment.
Even the companies that spend heavily on cloud services are reconsidering their big decisions that involve significant, long-term investments. Companies are reassessing their existing substantial spend to ensure their investments can be aligned with revenues or future revenue potential.
Observability tools are also subject to the same review. It is essential that the total operating cost of observability tools can also be directly linked to revenue, customer satisfaction, growth in business innovation and operational efficiency.
Why Do We Need Monitoring?
- If we had a system that would absolutely never fail, we wouldn’t need to monitor that system.
- If we had a system for which we never have to worry about being performant, reliable or functional, we wouldn’t need to monitor that system.
- If we had a system that self-corrects itself and auto-recovers from failures, we wouldn’t need to monitor that system.
None of the aforementioned points are true today, and it is obvious that we need to set up monitoring for our infrastructure and applications no matter what scale you operate.
What Is FOMO-Driven Monitoring?
When you are responsible for operating a critical production system, it is natural to want to collect as much monitoring data as possible. After all, the more data you have, the better equipped you will be to identify and troubleshoot problems. However, there are a number of challenges associated with collecting too much monitoring data.
One of the biggest challenges of collecting too much monitoring data is data overload. When you have too much data, it can be difficult to know what to look at and how to prioritize your time. This can lead to missed problems and delayed troubleshooting.
Another challenge of collecting too much monitoring data is storage costs. Monitoring data can be very large, and storing it can be expensive. If you are not careful, you can quickly rack up a large bill for storage.
When there is too much data, it can be difficult to see the big picture. This can make it difficult to identify trends and patterns that could indicate potential problems.
More data also means more noise. This can make it difficult to identify important events and trends.
Collecting too much monitoring data can also raise security concerns. If your monitoring data is not properly secured, it could be vulnerable to attack. This could lead to theft of sensitive data or disruption of your production systems.
Ultimately, an approach driven by the fear of missing out does not result in an optimal observability situation/setup and, in fact, can contribute to plenty of chaos, increased expenses, ambiguity between teams and overall increase in poor efficiency.
You can address this situation by being intentional in making decisions on all aspects of the observability pipeline including signal collection, dashboarding and alerting. Using service-level objectives SLOs is one of the strategies that offers plenty of benefits.
What Are SLOs?
An SLO is a target or goal for a specific service or system. A good SLO will define the level of performance your application needs, but not any higher than necessary.
SLOs help us set a target performance level for a system and measure the performance over a period of time.
Example SLO: An API’s p95 will not exceed 300ms response time
How Do You Set SLOs?
SLOs are generally set by customers. Yes, they are the ultimate authority. However, customers do not actually set SLOs as you can imagine. It is up to the business teams to tell the IT operations and development teams the expected performance and availability of a system.
For example, the business teams operating a marketing lead sign-up page can tell the IT teams that they want the page to load within 200ms at least 90% of the time. They would derive this conclusion by looking at the customer behavior already captured.
Now the IT teams can set the SLO for tracking by identifying SLIs(service-level indicators) in order to measure the SLOs over a period of time. SLIs are the specific metrics and query details of the metrics used to keep track of the SLO progression.
Here is what your observability life cycle looks like implementing an SLO-driven strategy.
There is an intentional loopback mechanism that is set in taking the SLO-driven strategy. Observability is never a settled problem. Organizations that do not continue reinventing their observability strategy fall behind very quickly, resulting in ambiguous tools, outdated processes and practices, which in the end increases overall operational cost while decreasing efficiency.
With this approach, you get the ability to scientifically measure your infrastructure and application performance over a period of time. Data collected as a result can be used to influence important decisions made on infrastructure spend which in turn helps improve further efficiency.
What Does This Tell Us?
Taking an SLO-first approach allows us to be intentional about the metrics to collect to meet commitments to business.
These are some of the benefits that organizations can achieve by following SLO-based observability strategy:
- Results in improved signal vs. noise ratio
- Reduces tool proliferation
- Enriched monitoring data resulting in reduced MTTR/MTTI
- Feedback loop provides continuous improvement opportunities
- Connect monitoring costs in relation to business outcomes, hence able to justify spend to management
Use SLOs to drive your monitoring decisions:
- Measure, revisit and review SLOs periodically based on outcomes
- Improve observability posture through
- Lower cost
- Reduced issue resolution time
- Increased team efficiency and innovation
We live in an era where efficiency is critical for organizational success. Observability costs can become uncontrollable if you do not have a proper strategy in place. SLO-driven observability strategy can help you set guardrails, track performance goals, business metrics and measure impact in a consistent manner while increasing operational efficiency and innovation.