DevOps and Error Monitoring: An Introduction to the CALMS Model
Gone are the days of waiting for customers to report software problems before they are fixed.
By its very nature, DevOps creates more visibility between teams and processes. Arguably, nowhere is this more relevant than in the discovery and resolution of crashes and performance problems that affect end users.
Error and performance monitoring platforms like Raygun fit perfectly into a DevOps environment, aiding communication and mitigating the risk of software errors even getting into customer’s hands in the first place.
This article will shine some light on how error reporting fits in with the DevOps processes using the CALMS model — arguably one of the most used frameworks when assessing a team’s readiness for a DevOps process.
Wait, What Is Error Monitoring?
Error and performance monitoring platforms exist to surface problems in production and beyond. They aim to make it easier for developers to build great quality software by providing insights into the cause of software errors and performance problems.
Usually, error monitoring platforms include features like:
- Error summary pages: to provide diagnostic details (like the stack trace),
- Error grouping and filtering: to consolidate error occurrences,
- Deployment tracking: to correlate errors with deployments,
- User tracking: to identify who is experiencing errors,
- Built-in Real User Monitoring: to provide data on the user experience at a code level.
Where DevOps and error monitoring meet is when providing visibility into the cause and effect of the problem so it can be resolved with greater speed and accuracy. Crash reporting and real user monitoring display a wealth of data in a way that’s relevant and digestible to all parties via data visualization.
- DevOps managers can spot trends in date and time graphs,
- Developers get the diagnostic details (like the stack trace) to resolve the error,
- Managers can assess the business cost after a detailed report of how many customers were affected and everyone can put preventative measures in place.
Let’s look into where error monitoring fits with one of the most popular conceptual DevOps frameworks: the CALMS model.
Why the CALMS Model?
The CALMS Model is a conceptual framework used to assess a company’s readiness to adopt a DevOps process. CALMS stands for Collaboration, Automation, Lean, Measurement and Sharing.
Using this framework, we can see where error monitoring is the most useful to both DevOps developers and leaders.
Collaboration (or Culture)
DevOps creates and encourages an environment where engineers are responsible for QA, writing and running their own tests to get their code out to customers.
There are no dedicated incident response teams; instead, developers and operations must work together to discover, triage and fix errors before they cause real business problems.
An obstacle to a fast resolution is monitoring in isolation rather than in one consolidated place. For example, developers look at logs, and managers might look in analytics software for the impact, creating data patches and holes where part of the story is missing.
If everyone had one source of data, issues would be resolved quickly. Error monitoring tools and processes provide this without the need to bounce between different software supporting a collaborative environment and shared responsibility.
Allowing for automation is really about creating reliable and efficient systems. Repetitive manual work costs a lot of money — and developers would rather build cool features than chase down software bugs in error logs.
Developers can spend a staggering 75 percent of their time searching for errors and performance problems using logs and vague customer reports. An error monitoring and real user monitoring platform automates that process with smart alerting via ChatOps or email summaries. There’s no need for a developer to spend time looking for the cause of errors, because they can be triaged and assessed in just a glance.
When we talk about lean, we often think of Agile processes allowing software teams to deploy quickly and imperfectly. A lean process comes without the fat of extra features that make little to no difference to the customer experience.
A few years ago, software teams discovered it’s much better to launch a product into the customer’s hands today than it is to wait for another six months for it to be perfect.
Error reporting is an essential part of this continuous delivery process, as errors are detected in production, before they reach the hands of customers. Just because a first iteration of a product might be minimal, doesn’t mean it has to be buggy.
Designing a software team’s KPIs is more about what not to measure, rather than what can be measured. There are so many tools providing metrics and reports it can be hard for teams to focus on the numbers that will move the needle.
Agile and lean concepts demand that teams only look at a few metrics. For example, in our software team, we only hold people accountable to four metrics:
- Users affected by bugs: We see total error counts as a misdirection. If a team has 10,000 errors that affects one customer, it’s not as bad as 500 errors affecting 250 customers. This metric makes it easy to prioritize our users.
- Median Application Response Time: The median response time is what 50% of customers experience (or faster). Performance makes money — 40 percent of users will leave a website that takes more than three seconds to load!
- P99 Application Response Time: Medians are great, but we also need to appreciate the upper limit. We track the P99 – the time taken for the 99th percentile of users. We don’t worry about it being slow — it will be. By slow, we want to make sure it’s only five seconds, not a whole minute. (We don’t often track P100 as it’s the domain of total timeouts, bad bots that hold connections open. Therefore it’s not an accurate representation of real users.)
- Resolved bugs vs. new bugs: Platforms like Raygun will group bugs by their root cause. Ideally, a team should be fixing bugs as quickly as they are creating them.
These four metrics serve to improve the user experience by putting it first (because at the end of the day, customers are our bread and butter) and reduce technical debt (which frees up our team to work on features, not remedial work.)
DevOps exists to bridge the gap between developers and operations teams. A big part of that process is to ensure shared responsibility.
Although there is a culture of “if you build it, you’re responsible for it” in DevOps, this doesn’t mean each developer has to be an operations expert, too. Ideally, DevOps teams should share the responsibility of issues discovered in code.
Developers on the support lines are most effective when paired with operations in each phase of the development lifecycle. For example, the developer will be responsible for addressing issues caught by end users, and also troubleshooting code in production.
Error monitoring aims to save time in this phase in particular, as the exact causes of issues and performance problems can be elusive. Instead of searching for the error and its impact, developers can see the problematic line of code on a page. As the developer makes their way through the backlog, issues can be cleared much faster and a faster feedback loop is created to the operations team via integrations with ChatOps and issue trackers. Shared responsibility naturally becomes easier as communication lines are cleared and not fogged with incorrect or irrelevant data.
DevOps success relies on software tools and processes, but ultimately, it’s about enabling people and culture. In this article, I’ve discussed where error monitoring fits in with the DevOps process, and how simple tools can support visibility and communication.