PagerDuty’s CTO Alex Solomon on Building Microservices
Deploying microservices on cloud as well as on-premise environments can offer enormous boosts in computing and agility. But once rolled out, the ability to effectively manage hundreds or even thousands of microservices that run in highly distributed environments becomes critical in order to maintain these ambitious deployments.
Alex Solomon, chief technology officer and co-founder at digital operations management PagerDuty, should know — Solomon started his career on the ground floor of cloud computing as a software engineer for Amazon.com over 10 years ago in 2006 and is now involved in the cutting edge of microservice deployments on the cloud. He described how cloud computing environments have evolved over the years and the current and emerging challenges associated with event triage, communication, collaboration and other concerns for this latest episode of The New Stack Makers podcast, hosted by Alex Williams, founder and editor-in-chief of The New Stack, during the recently held PagerDuty Summit in San Francisco.
In the early days of massive cloud deployments and during the beginning of Amazon Web Service (AWS) as a business unit, Amazon had to build many of its own tools for monitoring and managing events at scale, with larger firms such as HP, IBM and EMC also serving the market then, Solomon said. When Solomon was at Amazon from 2006-2008, Amazon largely relied on in-house built tools for event triage, for example, which have evolved into the tools and best practices of today for communicating incidents or alerts.
Today, of course, “applications has grown tremendously” and thus produce an exponentially higher number of events and alerts compared to say 10 to 12 years ago. “Instead of just having humans be there on the forefront of triaging all of those events and alerts, you have to have software systems to automate and filter out all the noisy alerts and noisy events and focus on the signal,” Solomon said.
While seemingly obvious, ensuring the right teams are in place is critical. “When you have a major incident where it’s actually impacting a business-critical application, you want to make sure that the right teams get on it as quickly as possible and only the right teams,” Solomon said.
However, in the event of a major failure that might have a major impact on business operations, organizations have a tendency to waste resources by putting too many people on notice. “Communicating gets really hard with a hundred people in a conference call if you’ve ever tried that,” Solomon said. “So this is something that we focus on from a private perspective and also from our own best practice perspective: get only the right people on the call in order to streamline the incident response-and-resolution process.”
In a highly distributed microservices environment, managing the dependencies of microservices has also emerged as a key challenge.
“You can have all of these services and microservices and they’re all talking to each other and you have to understand in this distributed environment who’s talking to what because if one service or one microservice goes down,” Solomon said. “The root cause might not be in the microservice. It might be based on one of the dependencies, one of the downstream dependencies”.
As microservices mature, “things are getting easier to manage on the node level,” but at the same time microservice environments remain highly complex. “Things can fail and do fail all the time. So you have to architect the systems for not just scale but resiliency and failure tolerance,” Solomon said. “At any moment, your container, server, application or a part of your application may go down and you have to architect around that so that it is failure tolerant.”
In this Edition:
1:55: Back then, what was event triage? What did that look like in the early days when you were starting your career?
6:53: The challenges that come with the surface area of these new tools
12:51: When you think of best practices, what are some of the things you’re starting to see that are proving effective?
16:40: How is PagerDuty helping teams collaborate?
18:55: Internally at PagerDuty, how have your views evolved on incident response and your own process as you scale?
23:53: What is really exciting to you these days?