Serverless Analytics: Metrics, Collection and Visibility
Analytics in serverless systems comprise three components:
- What data and metrics to collect
- How to collect data
- How to interpret and use the data.
Let’s take a look at all three.
What Data to Collect in Serverless Systems
Nate Taggart, CEO and co-founder of the serverless management software Stackery, said there are three types of metrics that need to be collected when managing serverless applications and architecture.
The first set of data that gets collected in serverless systems reflect the function-as-a-service nature of the architecture. Here, the idea is to check the performance of individual functions. “For example, how long did the function take to run,” said Taggart. “Some of that you get by default in AWS CloudWatch, but there are also a number of Application Performance Management (APM) vendors like IOpipe who are able to assist with creating more granular data.”
Taggart explains that in serverless, it is more than just a set of functions that are running. More important, perhaps, is taking a step back and seeing how data flows through the application as a whole. After all, in a serverless system, an event may trigger a function, and that carries out some sort of data transformation activity, for example, and the result is then used to trigger another action, and so on through a complete workflow.
“It helps you to tie an event that happens at one service tier to another service tier. You need to be able to map the data flow for a serverless system all the way through. X-ray from AWS does offer some of that, but for tracing analytics, Epsagon is the company that comes to mind for that,” said Taggart.
In addition, there are also error metrics to measure, these can take two forms:
- Runtime errors are when you have a bug in your code so the serverless app will break or fail when it reaches a certain point in the application execution
- Infrastructure-side errors are when there is an underlying error on the AWS/serverless provider side of things. This may just be because a function timeout timeframe has been set too short, or because there is an invocation error, or because a Lambda just fails to run.
At this year’s GlueCon conference, Taggart presented on resolving errors in serverless systems, advocating for the need to work towards self-healing systems that not only identify when errors occur but can then start fixing themselves in production.
One example of an infrastructure-side error would be concurrency limits. Taggart gives the example of a serverless application being throttled because more Lambda instances are required than are allowed to be concurrently running according to the owner’s AWS account. In those cases, an event will fail to be invoked and if unbuffered, will not retry and will not be visible in logs beyond CloudWatch metrics.
How to Collect Data in Serverless Systems: Logging
“Logging is unique in serverless,” said Taggart. “You do get some visibility in your serverless application, but you are going to want to surface meaningful log information. As a software development industry, we are used to stateful resources where you can always connect to a server and ask what are the logs on the server. Then, you could ship them to something like CloudWatch, or Loggly and review from there. But with serverless, if you don’t instrument your system to collect logs, after the function runs, you have no way of collecting data on what happened.”
Taggart said that serverless demands that metric collection is planned at the development cycle. While this may often be an aspirational best practice, with serverless it becomes a required best practice. Other serverless industry leaders, like Charity Majors from Honeycomb, also advocate for instrumentation design during the development cycle in order to be able to continue testing of serverless applications in production environments.
How to Interpret and Use Data in Serverless Systems: Visibility
Taggart believes that all of this work around analytics is due to a key shift in the organizational arrangements that come with serverless and software design of distributed applications at scale.
He explained: “All of this analytics work — identifying the metrics, collection, and ongoing logging — is based around one idea: We are asking devs to do operations work. From outside the industry, what developers do might look like magic. Devs write some code and application has some great new feature in it. But in reality, engineers from Ops and engineers from application development are separate disciplines, like heart and brain surgeons. So the shift here is from having 10 Ops engineers having access to the cloud accounts, to having hundreds of devs having access. That changes your organization. If engineers are responsible, they need the data and it needs to be contextualized. In the past, with a monolithic app, you had one set of logs. In serverless, you have distributed apps, with hundreds of functions each creating their own log systems. How do you correlate that across your whole architecture?”
Taggart gives an example of having an API backed by a Lambda function. If the API in a serverless workflow is getting too much traffic, or if the API consumer has hit their service limits, the Lambda function never gets invoked. So if a developer is watching the logs on Lambda, they don’t see any problem. Devs with application development skills might not have that ops skill of seeing data in context of the overall architecture, Taggart said, which is why Stackery has built a user interface to help devs visualize their data and serverless systems in a way that helps them simplify their learning curve.
“The visibility component of analytics is hugely important,” said Taggart. “Devs need access to the logs, they need to visualize that in the context of the architecture they are building. The dev should have access to that information and in a lot of ways that is new. This is the real DevOps that we as an industry have been talking about for a decade. Serverless enforces DevOps because devs have to instrument so that observability can occur at runtime.”
All of this points to some underlying challenges that are occurring in companies managing distributed applications at scale, DevOps, microservices, and who are introducing or including serverless architectures in their systems now.
In the past, a VP of engineering may have been able to go to their Ops team and ask for data on why an application is underperforming or broke, or why infrastructure costs were higher than budgeted. Now that ops team has been replaced by serverless and cloud engineering, which pushes the onus onto the developer teams more generally. So instead of 10 ops specialists, there are now 200 developers working under a vice president of engineering and each one is responsible for collecting the data on the elements of the system they are working on. How does an organization ensure consistency and auditability of traces and errors in that environment?
In turn, this challenge also surfaces a more fundamental issue: organizational structure itself. “We talk a lot about scalability for serverless, and when we talk about it, we are talking about the infrastructure, really we are asking can handle a lot of requests? What we should be talking about is how do we manage scalability of the organization. How do you make sure teams are in place to manage growth? You need some kind of governance. There is a reason why we have standards in our legacy infrastructure, and the idea that we don’t need that same level of governance in serverless is not true. If anything we need more because we are opening the doors of our systems to more engineers to manage,” said Taggart.
Towards Standardization and Automation for Serverless Analytics
Taggart says at the heart of improving analytics is the need for standardization and automation. Taggart advocates for using serverless as an opportunity to build self-healing applications, which means having policies and standardization processes in place so that instrumentation is automatically added during the build process. That is at the heart of what Stackery does, said Taggart, and he is seeing a greater focus on shifting to setting organizational systems in place to support those VPs of Engineering who are managing the hundreds of developers and ensuring consistency.
The idea is to move away from the old real world where an engineer builds a function and releases a Lambda in the enterprise production account and on the day that it fails, as is so often the case, that engineer is on holidays. If the organization has introduced standardization, if every resource has shipping logs and is collecting metrics, instead of tracking down which engineer built the function and understanding what they did, the organization can move directly into resolving the problem by addressing the reason why failure, or underperformance, or cost blowout, occurred.
“When building self-healing apps, you design resilience into the app. While developing the app you are thinking ahead to when it fails, how will you auto-recover and deal with failure,” encouraged Taggart.
Stackery is a sponsor of The New Stack.