4 Key Observability Best Practices
With bigger systems, higher loads and more interconnectivity between microservices in cloud native environments, everything has become more complex. Cloud native environments emit somewhere between 10 and 100 times more observability data than traditional, VM-based environments.
As a result, engineers aren’t able to make the most out of their workdays, spending more time on investigations and cobbling together a story of what happened from siloed telemetry, leaving less time to innovate.
Without the right observability set up, precious engineering time is wasted trying to sift through data to spot where a problem lies, rather than shipping new features — potentially introducing buggy features and affecting the customer experience.
So, how can modern organizations find relevant insights in a sea of telemetry and make their telemetry data work for them, not the other way around? Let’s explore why observability is key to understanding your cloud native systems, and four observability best practices for your team.
What Are the Benefits of Observability?
Before we dive into ways your organization can improve observability, lower costs and ensure smoother customer experience, let’s talk about what the benefits of investing in observability actually are.
Better Customer Experience
With better understanding and visibility into relevant data, your organization’s support teams can gain customer-specific insights to understand the impact of issues on particular customer segments. Maybe a recent upgrade works for all of your customers except for those under the largest load, or during a certain time window. Using this information, on-call engineers can resolve incidents quickly and provide more detailed incident reports.
Better Engineering Experience and Retention
By investing in observability, site reliability engineers (SREs) benefit from knowing the health of teams or components of the systems to better prioritize their reliability efforts and initiatives.
As for developers, benefits of observability include more effective collaboration across team boundaries, faster onboarding to new services/inherited services and better napkin math for upcoming changes.
Four Observability Best Practices
Now that we have a better understanding of why teams need observability to run their cloud native system effectively, let’s dive into four observability best practices teams can use to set themselves up for success.
1. Integrating with Developer Experience
Observability is everyone’s job, and the best people to instrument it are the ones who are writing the code. Maintaining instrumentation and monitors should not be a job just for the SREs or leads on your team.
A thorough understanding of the telemetry life cycle — the life of a span, metric or log — is key, from setting up configuration to emitting signals and any modifications or processing done before getting stored. If there is a high-level architecture diagram, engineers can better understand if or where their instrumentation gets modified (like aggregating or dropping, for example.) Often, this processing falls in the SRE domain and is invisible to developers, who won’t understand why their new telemetry is partially or entirely missing.
You can check out simple instrumentation examples in this OpenTelemetry Python Cookbook.
If there are enough resources and a clear need for a central internal tool, platform engineering teams should consider writing thin wrappers around instrumentation libraries to ensure standard metadata is available out of the box.
Viewing Changes to Instrumentation
Another way to enable developers is by providing a quick feedback loop when instrumenting locally, so that they can view changes to the instrumentation before merging a pull request. This recommendation is helpful for training purposes and for those teammates who are new to instrumenting or unsure about how to.
Updating the On-Call Process
Updating the on-call onboarding process to pair a new engineer with a tenured one for production investigations can help distribute tribal knowledge and orient the newbie to your observability stack. It’s not just the new engineers who benefit. Seeing the system through new eyes can challenge seasoned engineers’ mental models and assumptions. Exploring production observability data together is a richly rewarding practice you might want to keep after the onboarding period.
You can check out more in this talk from SRECon, “Cognitive Apprenticeship in Practice with Alert Triage Hour of Power.”
2. Monitor Observability Platform Usage in More than One Way
For cost reasons, becoming comfortable with tracking the current telemetry footprint and reviewing options for tuning — like dropping data, aggregating or filtering — can help your organization better monitor costs and platform adoption proactively. The ability to track telemetry volume by type (metrics, logs, traces or events) and by team can help define and delegate cost-efficiency initiatives.
Once you’ve gotten a handle on how much telemetry you’re emitting and what it’s costing you, consider tracking the daily and monthly active users. This can help you pinpoint which engineers need training on the platform.
These observability best practices for training and cost will lead to better understanding the value that each vendor is providing you, as well as what’s underutilized.
3. Center Business Context in Observability Data
Deciphering the business context in a pile of observability data can help shortcut high stakes in different ways:
- By making it easier to translate incidents affecting workflows and functionality from a user perspective.
- By creating a more efficient onboarding process for engineers.
One way to center business context in observability data is by renaming default dashboards, charts and monitors.
4. Un-Silo Your Telemetry
Teams need better investigations. One way to ensure a smoother remediation process is through an organized process like following breadcrumbs rather than having 10 different bookmark links and a mental map of what data lives where.
One way to do this is by understanding what telemetry your system emits from metrics, logs and traces and pinpointing the potential duplication or better sources of data. To achieve this, teams can create a trace-derived metric that represents an end-to-end customer workflow, such as:
- “Transfer money from this account to that account.”
- “Apply for this loan.”
Regardless of whether you’re sending to multiple vendors or a mix of DIY in-house stack and vendors, ensuring that you are able to link data between systems — such as adding the traceID to log lines, or a dashboard note with links to preformatted queries for relevance — will add that extra support for your team to perform better investigations and remediate issues faster.
Explore Chronosphere’s Future-Proof Solution
Engineering time comes at a premium. The more you can invest in getting high-fidelity insights and supporting engineers in understanding what telemetry is available, instrumenting will become fearless, troubleshooting faster and your team will make future-proof, data-informed decisions when weighing options.
As companies transition to cloud native, uncontrollable costs and rampant data growth can stop your team from performing successfully and innovating. That’s why cloud native requires more reliability and compatibility with future-proof observability. Take back control of your observability today, and learn how Chronosphere’s solutions manage scale and meet modern business needs.