DevOps / Monitoring / Sponsored / Contributed

5 Monitoring Characteristics SREs Must Embrace

9 Jun 2021 6:29am, by

Theo Schlossnagle
Theo founded Circonus in 2010, and continues to be its principal architect. He has been architecting, coding, building and operating scalable systems for 20 years. As a serial entrepreneur, he has founded four companies and helped grow countless engineering organizations. Theo is the author of Scalable Internet Architectures (Sams), a contributor to Web Operations (O’Reilly) and Seeking SRE (O’Reilly), and a frequent speaker at worldwide IT conferences. He is a member of the IEEE and a Distinguished Member of the ACM.

In today’s world of service-centric, “always on” IT environments, more organizations are implementing site reliability engineer (SRE) functions that are responsible for defining ways to measure availability and uptime, accelerate releases and reduce the costs of failures.

SREs operate in continuous integration/continuous delivery (CI/CD) environments where user demand drives frequent, high-performing release cycles and systems change quickly. It’s so dynamic that traditional monitoring approaches are trying to solve problems that no longer exist and simply do not meet new monitoring expectations and requirements.

SREs need a new, updated way to approach monitoring.

Today’s systems are born in an agile world and remain fluid to accommodate changes in both the supplier and the consumer landscape. This highly dynamic system stands to challenge traditional monitoring paradigms.

At its heart, monitoring is about observing and determining the behavior of systems. Its purpose is to answer the ever-present question: Are my systems doing what they are supposed to? In the old world of slow release cycles, often between six and 18 months, the system deployed at the beginning of a release looked a lot like the same system several months later. Simply put, it was not very fluid, which is great for monitoring. If the system today is the system tomorrow and the exercise that system does today is largely the same tomorrow, then the baselines developed by observing the behavior of the system’s components will likely live long, useful lives.

In the new world of rapid and fluid business and development processes, change occurs continually. The problem here is that the fundamental principles that power monitoring, the very methods that judge whether your machine is behaving itself, require an understanding of what good behavior looks like. To understand whether systems are misbehaving, you need to know what it looks like when they are behaving.

In this new world, many organizations are adopting a microservices-systems architecture pattern. Microservices dictate that the solution to a specific technical problem should be isolated to a network-accessible service with clearly defined interfaces, such that the service has freedom. This freedom is very powerful, but the true value lies in decoupling release schedules and maintenance, and allowing for independent higher-level decisions around security, resiliency and compliance. The conflation of these two changes results in something quite unexpected for the world of monitoring: The system of today neither looks like nor should behave like the system of tomorrow.

Characteristics of Successful Monitoring

For SREs to be successful, they need a new, modern way to manage and monitor rapidly scaling and rapidly changing IT infrastructure, where monitoring is a key component of service delivery. So what should monitoring in the world of SREs look like? Organizations and SREs who are successfully adjusting their monitoring strategies have the following characteristics in common:

1. Measure Performance to Meet Quality-of-Service Requirements

It is time to move beyond only pinging a system to see if it is up or down. Pinging is useful, but not the same as knowing how well the service is running and meeting business requirements. Knowing that a machine is running and delivering some subset of a service being delivered to a customer, and to have that knowledge in real time, is real business value.

The next question becomes how to most efficiently measure performance for those quality-of-service requirements. The answer is to measure the latency of every interaction between every component in the system. In this service-centric world, high latency is the new “down.” Instead of just checking for available disk space or number of IO operations against that disk, it’s important to check, for example, the latency distribution of the API requests. Just knowing how much memory the system is using isn’t enough. It’s much more important to know how many microseconds of latency occur against every query.

What should be measured is the actual performance of all of the components in the system and along the path of service delivery between the customer and the data center. Don’t just check to see if an arbitrarily determined “average” transaction time has been met or that a system is up. While these kinds of traditional metrics are still useful and necessary to monitor, it is crucial to see if your quality-of-service requirements are met.

Every user on a web app or every customer website hit uses a plethora of infrastructure components, and the quality of the user’s experience is affected by the performance of numerous microservices. Completely understanding performance requires checking the latency of every component and microservice in that system. All of those latencies add up to make or break the customer experience, thereby determining the quality of your service.

Will the quality of your service be affected if one out of every 100 database queries is painfully slow? Will your business be impacted if five out of every 100 customer experiences with your service are unpleasant? The traditional method of storing and alerting on averages leaves SREs blind to these situations. Every user matters, and so does every user interaction. Their experience is directly affected by every component interaction, every disk interaction, every cloud service interaction, every microservice interaction and every API query, so they should all be measured.

Imagine measuring the total latency experienced by the user and alerting SREs to unacceptable latency in subcomponents and microservices, before they affect end-to-end service quality. If it is not measured, then SREs are blind to the underpinnings of what causes a web app or website to meet, or to fail, service-level agreements.

SREs require the ability to reliably and cost-effectively measure all data, not just samples, from everything. You need full observability of all your infrastructure and all your metrics. Not only will this help accelerate problem resolution, but once you have these metrics collected, it’s possible for your teams to surface additional business value within this sea of data.

2. Democratize Data To Improve Productivity and Save Time

One of the hallmarks of conventional monitoring is having disparate monitoring tools that each have a specific purpose and create silos of metric data. It’s a patchwork environment with a lack of consistent standards and processes; as a result, there’s no ability to share information in a clear and cohesive way among different teams within the organization.

Having disparate tools often requires more costs and resources, and often only a few people know how to use them. This not only creates the potential for serious disruptions if those people leave the organization, but it also prevents teams within the IT organization from being able to find answers on their own. For example, an engineer responsible for application performance monitoring cannot get needed information on network health without relying from someone on that team to get it for them, making tasks like troubleshooting take longer. At the strategic level, there is no way to get a comprehensive and consolidated view of the health and performance of the systems that underpin the business.

By centralizing all of your metrics — application, infrastructure, cloud, network, container — into one observability platform, your organization gains a consistent metrics framework across teams and services. You democratize your data so anybody can immediately access that data any time and use it in a way that is correlated to the other parts of your business, eliminating the time-consuming barriers associated with legacy monitoring tools. A centralized platform that consistently presents and correlates all data in real time consolidates monitoring efforts across all teams within the organization and enables the business to extract the maximum value from its monitoring efforts.

3. Gain Deeper Context to Reduce MTTR and Gain Higher Insights

Today’s SREs are swimming in data that is constantly spewing from every infrastructure component — virtual, physical, or cloud. Identifying the source of a performance issue from what can be millions of data streams can require hours and hours of engineering time using traditional monitoring processes. To quickly troubleshoot performance issues, SREs need more context.

Metrics with context allow SREs to correlate events, so they can reduce the amount of time required to identify and correct the root cause of service-impacting faults. This is why it’s imperative SREs have monitoring solutions that are Metrics 2.0 compliant. Metrics 2.0 is a set of conventions, standards and concepts around time-series metrics metadata with the goal of generating metrics in a format that is self-describing and standardized.

The fundamental premise of Metrics 2.0 is that metrics without context don’t have much value. Metrics 2.0 requires metrics be tagged with associated metadata or context about the metric being collected. For example, collecting CPU utilization from a hundred servers without context is not particularly useful. But with Metrics 2.0 tags, you will know that this particular CPU metric is from this particular server, within this particular rack, at this specific data center, doing this particular type of work. It’s much more useful.

When all metrics are tagged in this manner, queries and analytics become quite powerful. You can search based on these tags, and you are able to slice and dice the data in many ways to glean insights and intelligence about your operations and performance.

4. Articulate What Success Looks Like

Using a language to articulate what success looks like allows people to win. It is disheartening to think you’ve done a good job and met expectations, and then learn the goalposts have moved or that you cannot articulate why you’ve been successful. The art of the service-level objective (SLO) reigns here. SLOs are an agreement on an acceptable level of availability and performance. Understanding the service your business provides and the levels at which you aim to deliver that service is the heart of monitoring.

Because an SLO is an availability and a performance guarantee, it should not be set around identifying when things are broken. Rather, SLOs should be set around customer perceived value, because this is what directly affects your ability to be successful.

A lot of organizations spend significant effort trying to set their SLOs correctly. Unfortunately, this is wasted effort, because you’re going to be wrong. The approach should not be to get your SLOs perfect the first time, which is impossible. Rather, SLOs should be an iterative process. You should have a feedback loop that informs you about whether you need to change your concept of what deserves an SLO and what the parameters should be, based on information you learn every day. The key is to have flexibility with your SLOs. You will need to reassess them regularly to ensure they’re not too loose and not too tight.

5. Retain Your Data So You Can Reduce Future Risk

Monitoring data has often been considered low value and high cost. Times have changed and, as with all things computing, the cost of storing data has fallen dramatically. More importantly, SREs have changed the value of long-term retention of this data. SREs operate in a culture of learning. When things go wrong, and they always do, it is critical to have a robust process for interrogating the system and the organization to understand how the failure transpired. This allows processes to be altered to reduce future risk. At the pace we move, it is undeniable that your organization will develop intelligent questions regarding a failure that was missed immediately after past failures. Those new questions are crucial to the development of your organization, but they become absolutely precious if you can travel back in time and ask those questions about past incidents. Data retention can often lead to valuable learning that reduces future risk.

SREs Require More Advanced Monitoring

The reality of today’s “always on” service-centric IT environments means that monitoring plays a different and more impactful role than it has in the past. As such, SREs have new, more advanced requirements and expectations when it comes to monitoring. As you embrace these monitoring characteristics, you’ll immediately begin to elevate the relevance of monitoring to your business’s success. You’ll gain lots of other benefits as well, like faster problem identification and resolution, full visibility into all your metrics, better performance, reduced costs and more confidence in the accuracy of your decisions.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Real.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.