Why You Must Keep Error Monitoring Close to Your Code
Rare is the DevOps team that has the bandwidth to manually parse through and prioritize what needs to be fixed among what can number hundreds, if not millions, of application-error alerts. This includes distinguishing between minor glitches and those errors that can bring to a screeching halt an organization’s capacity to meet its customers’ needs and expectations.
A viable error-monitoring system should, ideally, automate the communication of error data in a way that indicates what must be done to make a fix.
The error alerts users receive must be “actionable,” Ben Vinegar, vice president of engineering, for error-tracking software company Sentry, said. “That’s also really hard problem” to solve, Vinegar said.
In other words, error monitoring must also be code-centric.
In this episode of The New Stack Makers podcast, Vinegar discusses what error monitoring means today and how, among different versions of monitoring, detecting errors has emerged as a critical capability for organizations today.
In the early days of error monitoring, a software product might have indicated “‘hey, you’re running a web server that is broken and it’s crashing,” Vinegar said. A more advanced type of system might consist of blindly rummaging through a Web server log file to see “if some sort of crash dump or stack trace was there,” Vinegar said. “A text file doesn’t really email you or send you a message when something is wrong,” Vinegar said. “You have to be really proactive and rigorous in periodically checking it.”
Sentry’s offering reflects how error monitoring has “really expanded as a field,” Vinegar said. “Sentry is an open source, cross-project, error monitoring server stack, and it works really for every platform,” Vinegar said. “In those early days [error monitoring] was very platform-specific.”
Error monitoring should continue to evolve as well. Sentry is developing, for example, capabilities where we that extends far beyond just absorbing error data and begin to offer more precise statistical analysis. “An error can happen 100 times, it can happen 10,000 times, it can happen a million times. But depending on the percentage of what your overall traffic shape is for that endpoint, page or whatever, you don’t really know whether that’s bad or not,” Vinegar said. “A million errors sounds really bad, but if you’re processing a billion transactions of an endpoint, maybe that’s not so bad. If 100 errors happen to 100 attempts at doing a checkout flow on an e-commerce site, that’s really bad and that’s gonna harm your business — so, being able to distinguish, relative to your overall traffic patterns is like a sort of a crucial next step and what we’re working on right now.”