Essential Metrics to Monitor Serverless on Amazon Web Services
Thundra sponsored this post.
Troubleshooting serverless applications means tying together many different resources. Lambda functions run on-demand, with hardware that exists only for the duration of requests, and logs that can be spread out across multiple resources. Therefore, the first line of defense when making a serverless application more maintainable is getting a handle on the metrics that matter, and what they mean. Aggregate analysis of metrics will often be the primary signal that you receive when problems occur in production, and knowing which metrics to track is half the battle. In this article, we’ll explore the serverless metrics that are critical to your application’s health.
Which Metrics Are Essential?
Let’s dive in by setting some boundaries for our discussion. In this exploration, we’ll focus on three general categories of metrics: operational, load-related, and other. These metrics will each represent a portion of your application’s execution process. To build a coherent view of your application’s production performance, you’ll need to take them all into account.
The easiest metrics to understand are operational metrics, which track the operational performance of your serverless functions by comparing the results of calls. This allows you to build thresholds for alerting, establishing the critical metrics that help you ensure your application continues to run without issue. In AWS Lambda, we primarily focus on two operational metrics: aggregate error count and aggregate execution count.
The aggregate error count is a simple count of the number of errors your Lambda function encounters over a period of time. While this can give us a general sense of how successful our functions are in production, it needs another piece of data, aggregate execution count, to give us a complete picture. Aggregate execution count is exactly what the name implies: the number of times a function is executed within a given timespan. By comparing the aggregate execution count to the aggregate error count, you can build a generalized picture of the reliability of your serverless application’s component functions.
Load-related metrics indicate the relative load placed upon the hardware running your code. In a traditional software application, this would include metrics like CPU usage, RAM usage, or network bandwidth usage. Serverless applications don’t maintain hardware statistics with the same fidelity as their traditional counterparts, but there are still a few load-related metrics we can look to for clarity.
The load-related metrics are centered on the duration of execution. The duration of execution can be measured through a number of different means. The first is the average duration of execution. This is the average execution time, in milliseconds, for each of your application’s functions. It is calculated using the statistics provided in the context object, originating from within the AWS Lambda ecosystem. This tells us how long, on average, our serverless functions will run when called, and it can help us gauge resource usage as it relates to the overall function timeout to which Lambda functions must adhere.
This execution time metric, though, can be inaccurate when trying to understand the aggregate performance of your functions. As function duration can vary widely, it’s important to look at execution times from a statistical perspective. This means that we want to look at metrics in terms of the distribution of your functions’ execution times. We do this using probability thresholds, such as p90 and p99. These metrics deal with the execution time of your functions from a statistical perspective, with p90 giving you the time at which 90% of your function calls, (on average) will complete, and p99 expanding upon this to include 99% of your function calls. These methods will tell you how close you are getting to various critical values, such as the timeout value for your Lambda functions. If the p99 time for your serverless application is very close to the timeout value of your Lambda functions, for example, you are going to be more likely to see timeout exceptions from your serverless functions as they execute, giving you critical information on where to start your investigation.
The next metric is the number of function throttles. When AWS Lambda functions begin to execute too often, or with too much concurrency, the rate at which they are called is throttled. This metric simply counts the number of throttles seen by your Lambda functions as they execute. The higher this number, the higher the risk that any individual call will be throttled due to excessive concurrency, system load, or any of the other potential factors.
The final load-related metrics are measures of infrastructure needs, like a function’s provisioned concurrency and capital-driven metrics like execution cost. These metrics are directly driven by the load your serverless application places on the AWS Lambda infrastructure. Provisioned concurrency controls how many functions you can execute at once, with executions above the provisioned concurrency threshold being subject to throttling by AWS. This reduces the response time of your functions, negatively impacting customer experience. On the financial side, capital-driven metrics like execution cost (delivered in the context object for each function execution) help you control your infrastructure spend. This is critical as your application starts to scale, as sudden growth in Lambda function execution requests will be directly tied to increases in your monthly AWS bill.
The last set of metrics we’ll look at are specific to each application and the business that drives it. These business metrics are built using custom metrics that your developers add to your application, and can include any metrics that your developers are able to devise. These metrics use custom information built on top of your application’s feature set, providing meaningful measures of your application’s performance as it relates to your critical user-facing functionality.
The key is to identify the metrics most critical to your application’s health in terms that can be actively reported upon, measured, and monitored. For example, if you were writing a serverless payment processor, you might want to track the number of transactions recorded in your application per day so that you can get a feel for traffic spikes in your application. These metrics will be tied to the business goals of your application, and when coupled with automated infrastructure metrics they can help you identify problem areas in your application, stress points in your architecture, or business validation failures — depending on your need, the sky’s the limit in terms of metrics that matter to your business.
Getting Started with Native Tools
Once you’ve settled on an applicable set of metrics for your serverless application, it’s time to build your toolchain. We’ll begin our analysis with the AWS-native tools available for monitoring Lambda functions in a serverless application. That list starts with CloudWatch, which provides a number of key metrics, viewable both at a per-service level and a cross-service level spanning multiple AWS resources. CloudWatch defines a number of highly valuable metrics for your serverless application right out of the box.
CloudWatch gives you access to hard statistics on your function, including invocation count, error count, average duration, and average throttle. CloudWatch also gives you the capability to build in custom metrics, with simple API calls populating CloudWatch with the metrics that are important to your application. CloudWatch Insights let you go farther, using analysis of structured log data to automatically generate metrics based on your application logs — all without any extra code. You can view these statistics for a single AWS region or across multiple regions. CloudWatch provides many of the crucial operational and load metrics that will drive your application’s health.
Shortcomings of Native Tools
While native tools give us a powerful picture at a glance, there are a few holes that can reduce the utility of the metrics they generate. At their core, native tools like CloudWatch are limited in scope. They don’t let you build a cohesive and comprehensive view of your application — everything is conducted at the function level, or in an aggregate fashion that makes further analysis challenging. CloudWatch metrics in particular can be very limited if you’re looking to monitor behavior, as functions are presented irrespective of the control flow that invoked them. This leads to a complex user interface that you need to understand fully before you can navigate it, and it gives an incomplete view of the metrics that truly matter for your application.
Moving beyond Native Tools with Custom Business Metrics
Native tools give you a lot of information about current execution. With the operational and load metrics provided out of the box, you can build a general picture of your application’s current execution characteristics. However, incorporating metrics from outside this set is often critical to business success. Envision a payment processor serving multiple countries. You’ll need to track the transactions that take place in each country individually when provisioning server capacity. This information is specific to the business purpose of the application, and thus cannot be defined in terms of the metrics available to the AWS platform.
Once you’ve defined the metrics that matter to your business, you’re still tasked with finding a way to feed them into your monitoring dashboard. In a native-only approach, your metrics will be limited to the tools available within AWS, which speak more to the general characteristics of Lambda execution than to any particular custom measurement. Thundra lets you consolidate all of your resources into the Smart Dashboard, helping you get the metrics you care about running in production, complete with monitoring and alerting.
Building a Complete Picture with Thundra
Integrating a third-party tool can help to solve many of the pain points of native AWS monitoring tools. Thundra serves as an online dashboard for the metrics that matter to your application, incorporating the metrics generated by your serverless application in CloudWatch and building additional functionality on top of them. The searchable UI makes these metrics much more accessible, but the presentation only scratches the surface.
One critical Thundra feature is the AI-driven anomaly detection functionality in the Smart Dashboard. Through the automated alerts and insights generated in Thundra’s platform, you can gain deep visibility into your application’s vital metrics without the need for custom configuration. Thundra gives you the power to define the business metrics you need, combining all of the metrics reported by your functions into a single, easy-to-use dashboard.
Taking Your Application to the Next Level
Monitoring is crucial for detecting when things go wrong with serverless applications. While it’s fairly straightforward to come by operational and load-related metrics, native tools leave a lot on the table when it comes to collection and publication of business-specific metrics. Many of the metrics offered through CloudWatch can also miss the mark due to a lack of discoverability, searchability, or grouping granularity. With Thundra you can build on top of the robust support for operational and load metrics in AWS, adding monitoring and alerting using the Smart Dashboard, which helps you keep abreast of the metrics that matter in your application. With Thundra, you can bring all the information you need together in an easy-to-use interface that can drive your application to the next level.
Amazon Web Services is a sponsor of The New Stack.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: firstname.lastname@example.org.