Raygun APM Mixes Traditional Monitoring with User-Centric Observability
How do you manage complex self-orchestrating clusters of microservices without overloading your DevOps team with information? With modern applications built on new architectures, for every old problem we solve, we find new ones. That’s where a new generation of Application Performance Monitoring (APM) tools come into play, mixing a customer-centric focus on application observability with deep application integration — getting to the root cause of user issues quickly.
New entrants in the APM market are employing the concepts of observability to better manage and provide insight into this complexity. Observability with APM means understanding the depth of the data, uncovering the underlying reasons for the abnormal patterns that may be observed. And it encourages a developer-friendly approach — making the data accessible and easily understood by developers who are increasingly involved in monitoring and operational tasks.
This new generation of monitoring and observability tools built for the hybrid, multi-cloud era link what full stack developers know about apps and their underlying infrastructure with the resulting customer experience; helping teams make better operational decisions, prioritizing what needs to be done and when. Such tools give visibility and insight into a stack that stretches from end-user to line-of-business software, working across clouds and built on microservices.
New Market Entrants
The APM market is growing, with tools from a wide range of vendors, including development giants like IBM, Microsoft and Oracle and market leaders such as CA Technologies, Dynatrace and New Relic. Gartner predicts that by 2021 enterprises will monitor 20 percent of their applications with APM tools, compared with 5 percent in 2017. There’s also been consolidation, with Cisco’s 2017 acquisition of AppDynamics.
But there’s a lot of complexity in APM, from adding monitoring code to your apps, to interpreting the commonly used performance waterfall charts. Traditional APM tools provide a lot of information, but it’s tailored to the needs of operations teams rather than developers, making it difficult to go from reports to refactored code.
Monitoring tools have now reached a point of inflection where we start to move from measurement and monitoring to understanding the customer impact of events we observe in complex, self-orchestrating systems. Honeycomb.io, for example, doesn’t talk about measurement or monitoring, but instead puts its resources into explaining observability and discussing its context. Blue Medora believes that data is dimensional, not flat, and performance metrics must include context to truly provide insights through what it describes as an IT monitoring integration service. Raygun’s Application Performance Monitoring (APM) for .NET applications (with further language support coming soon) takes a hybrid approach to handling issue reporting —mixing the familiarity of traditional monitoring tools with modern observability practices.
“We analyze the trace data as it comes in and we then build up an issues list,” John-Daniel Trask, CEO of Raygun, said. Instead of generating a long list of problems, Raygun APM takes those thousands of instances of an error or an underlying performance issue, and focuses on the underlying root cause down to code level.
APM’s roots are in its instrumentation to provide alerts. And alerts are awesome as long as they don’t overload a system and start creating problems of their own — like false positives. This approach provides views of the overall health of a system and why it is performing in the manner that it is.
“We saw APM as a lot of charts and people having to build up different alerts and things like that and we thought, ‘You know, there’s got to be a smarter way’,” Trask said. The result is an inbox-based approach that surfaces issues that can be quickly improved. Once an issue has been marked as resolved it will alert developers only if it returns.
Observability provides a deep view into your code, highlights bottlenecks and other effects on performance, and gives an end-to-end view of performance from the user’s point of view. Systems engineer and author Cindy Sridharan’s description states “observability is a superset of monitoring, combining alerting/visualization, distributed systems tracing infrastructure and log aggregation/analytics to provide better visibility into IT systems health.”
Building an APM with a User Focus
Building on a background in error reporting and diagnostics, Raygun added real user monitoring in 2015. Trask describes error reporting as “like having a black box flight recorder,” which is good for knowing why something happened to your software. Real user monitoring is very different, as Trask says, “It’s like trying to understand airline ratings, who do I actually want to fly with? Who is giving me great performance?”
Focusing on customers makes a lot of sense. If a user fails to complete a transaction, is it a problem with the design of the application, its performance, or a problem with a back-end service? What matters is not how the overall service is performing, but whether it’s performing in a way that supports users and the business. A failed transaction doesn’t mean that an operation has failed, it means that you have lost a sale and a disappointed customer is unlikely to return. When it’s only a small percentage of transactions that fail, you might not prioritize fixing the underlying problem. But when that small percentage is linked to large shopping carts, or specific users in a specific country, there’s a significant upside in fixing the problem.
That focus led to Raygun’s next step, the development of an APM tool, launched in late 2018. Customers were asking for something that could add to the existing Raygun tooling, building on how it reported software problems. Hyperfish, which helps enterprise companies identify and fix out-of-date data in their directories, is one of those customers.
“Having that telemetry and those tools available at our engineers’ fingertips is really critical. When an alarm goes off in the middle of the night, we need to know very quickly if it’s something important or if it’s something that can wait. And if it is important, what’s going on in the software. We’re able to do that because of tools like Raygun,” said Chris Johnson, chief technology officer and co-founder of Hyperfish.
Initial support was for .NET and Azure App Service, with .NET Core, Ruby, and Java support following throughout 2019. It’s a set of platforms that should cover most common modern development use cases.
Taking a customer-centric approach to APM is important, as you’re working to solve customer problems. If a process is blocking 15 percent of the time in a shopping cart, you could be losing significant sales over time. Building this information into APM is critical: it allows you to prioritize issues and to focus on what makes a service work well for its users. Trask describes this as helping tools deliver value,
“Raygun will say, ‘Here’s the number of users that actually had an issue or had a bad time,’ but going further, they can even attach their metadata about who those users are. And so they can say, ‘Look, here are our customers that actually had the issue’,” Trask said.
APM Plus Observability
It’s an approach that mixes familiar measurement and monitoring tooling with a modern observability paradigm, one in which the health of individual events or a customer’s experience are just as important, or even more so, than the health of the system. This approach is at the heart of Raygun’s APM tooling; providing a way of understanding why something happened and what can be done to reduce the risk of it happening again. Mixing traditional monitoring with observability makes it easier to bring new approaches into developer and operations workflows.
Other APM vendors continue to focus on measuring performance and finding bottlenecks in code, rather than understanding just what effects those issues have on user experience. Choosing to double down on observability is key to delivering an effective operations strategy for modern distributed applications. While tools like AppDynamics and New Relic are starting to add support for microservice monitoring, they still have a focus on untangling and managing monolithic applications.
The key to delivering effective APM is collecting and managing large amounts of data, and delivering it to all of the parts of the business that need it. Steven Elliot, vice president of the IT infrastructure and cloud practice at IDC, talks of it as “a huge source of valuable real-time operational data, […] a heads-up display for how businesses are being managed.” It’s a model that maps well to Raygun’s customer-centric approach, and intention to use machine-learning techniques to prioritize data in future releases.
Giving developers tools to understand the effects of problems in their code also starts to break down barriers inside organizations. That’s important when developers need to work with the business owners of their product. Being able to put a name to a customer who’s had problems can help the business protect crucial accounts. Similarly if sales teams get reports of concerns from customers, developers can drill down into the records of specific transactions to understand possible performance issues.
Raygun’s Roadmap to a Modern Toolkit
Bringing together APM, crash reporting and real user monitoring in an observability and performance toolkit isn’t going to happen instantly, but it’s Trask’s aim for the product suite.
“We’ve sort of taken that philosophy across the board with our products to say, rather than thinking just about the engineer as our customer, think about how they’re using our software to make their software better for their customers,” Trask said.
Another key difference between Raygun’s APM tooling and other APM packages is its pricing. Generally APM tools are priced on a per-server basis. That approach works well for traditional applications designed for a fixed infrastructure, with a fixed number of servers. However it causes problems with cloud native architectures, which scale automatically, deploying new container instances on demand. With traditional billing models that could significantly add to costs, with thousands of instances running at a time. Instead, Raygun uses a per-trace billing model to reduce costs. As IDC’s Elliot notes, “Ultimately, the goal of analytics is to enhance and accelerate problem identification and resolution.” Making data affordable can only help deliver on that promise.
The same model can be used to support serverless tooling, with an extension in the Azure Marketplace to handle autoscaling in Azure websites. Integration is a key element of Raygun’s tooling, with links into common developer tools and platforms such as Jira and Slack. You’re able to go straight from an error to the relevant method, via your source control environment. Trask sees this as a key area for future development, as well as integration with other sources of application metrics.
“We’ll also be opening up more APIs to allow Raygun’s data to go into other systems,” he said. “We already have some of those APIs, but broadening the availability of it is certainly going to be a win. And we’ll keep an eye on things like the Functions as a Service offering to see whether we can bolster the APM story on those.”
Raygun’s tooling is a promising new entrant in a fast growing market. A cloud-friendly pricing model, and a customer-centric approach to understanding performance issues go well together, especially when managing and monitoring typical cloud applications. Integration with common developer tools should also help with uptake, both in developer and operations communities, as well as with business teams looking for a deeper understanding of their applications and their users.
In this podcast, Chris Johnson, co-founder and CTO of Hyperfish, explains how they approach cloud native monitoring with Raygun.