The Role of Site Reliability Engineering in Microservices
You can always spot the hot jobs in technology: they’re the ones that didn’t exist 10 years ago. While Site Reliability Engineers (SREs) did definitely exist a decade ago, they were mostly inside Google and a handful of other Valley innovators. Today, however, the SRE role exists everywhere, from Uber to Goldman Sachs, everyone is now in the business of keeping their sites online and stable.
While SREs are hotshots in the industry, their role in a microservices environment is not just a natural fit that goes hand-in-hand, like peanut butter and jelly. Instead, while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.
That’s because SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.
As Google engineers essentially invented the role, the company offers a great deal of insight into how they manage systems that handle up to 100 billion requests a day. They boil down reliability into an essential element, every bit as desirable as velocity and innovation.
“The initial step is taking seriously that reliability and manageability are important. People I talk to are spending a lot of time thinking about features and velocity, but they don’t spend time thinking about reliability as a feature,” said Todd Underwood, an SRE director at Google.
Underwood said reliability and availability should be considered at every level of a project. As an example, he cites the way Gmail fails by dropping back to a bare HTML experience, rather than by halting all-together. “I’ll take the ugly HTML [version], but I can read my email. Availability is a feature and the most important feature. If you’re not available, you don’t have users to evaluate your other characteristics. Organizations need to choose to prioritize reliability.”
Underwood stipulated that every organization is different and that some of the issues Google encounters are not typical. But he did advocate for some more holistic practices.
“For distributed applications, we’re running some kind of Paxos consistent system. We have a whole chapter on distributed consensus. It seems like a computer science, nerdy thing, but really if you want to have processes and know which ones are where, it’s not possible without Paxos in place,” said Underwood. Paxos is the algorithm for distributed consensus gathering, often used to work out inconsistencies that can arise in distributed systems.
Underwood highlights another aspect of the SRE job that is essential, here, however: visibility. When microservices are throwing billions of packets across constantly changing ecosystems of cloud-based servers, containers, and databases, finding out what went wrong where is essential to troubleshooting any type of problem. This is where the full stack aspects of an SRE’s job come into place.
Google recently introduced a number of tools just for this type of work.
The whole market over the last few years has been shifting very deliberately towards microservices. We see this with Kubernetes and Istio, and the general move to the cloud from the data center. There are some challenges along the way. If you have 100 containers, things like doing a stack trace on a monolith become very difficult. You need a distributed trace,” said Morgan McLean, Product manager on Google Cloud Platform.
“To understand the health of your entire application and see how a transaction is going to flow through all these different microservices, you have to have a system that is going to help you navigate that. You want something that is going to think in terms of the transaction,” Matt Chotin, AppDynamics
To remedy this, Google recently released Stackdriver Trace, Stackdriver Debugger, and Stackdriver Profiler. There’s a reason these tools sound like old-school testing and operations tools from traditional enterprise vendors: they perform the more traditional troubleshooting tasks developers and operations people are used to, but with a focus on microservices and performing these duties in the cloud.
Stackdriver Profiler is in beta, but allows for direct CPU utilization monitoring on applications running inside of a cloud, while Stackdriver Debugger offers a way to essentially insert breakpoints into cloud-based microservices-based applications, and Stackdriver Trace offers the full-stack tracing capabilities McLean alluded to.
“This is really powerful for general performance improvements and powerful for cost reduction,” said McLean of Stacktrace Profiler. “Snapchat tried it out, and within a day of collecting data they realized a very small piece of code — I think it was a regular expression — which should not have even shown up in Profiler, was actually consuming a fairly large amount of CPU. This could happen to anyone. It happens to Google. The Snapchat demonstration was just a really great demonstration of the power of this profiling technology.”
“Without tools like this, this generally isn’t possible. Tracing was becoming a common industry practice. Profiling and production debugging are a little more unique in our offering,” said McLean.
The focus on new style tooling is shared by Matt Chotin, senior director of technical evangelism at AppDynamics. He said that teams need to rethink the way they determine the health of entire applications, once it’s been moved from monolith to microservice.
“You have a myriad of systems. The joy of microservices is that you get to pick the stack that’s right for a particular piece. Each thing might have its own way of monitoring, its own metrics, etc. To understand the health of your entire application and see how a transaction is going to flow through all these different microservices, you have to have a system that is going to help you navigate that. You want something that is going to think in terms of the transaction,” Chotin said.
The engineer shouldn’t think in terms of whether the service is up or down, Chotin said. “Your DevOps team cares about looking at a service to know general availability, but as far as whether or not you are serving the business correctly, you need monitoring that can traverse the entire ecosystem, from application code to infrastructure code,” said Chotin.
Google’s Underwood said that the overall goal for SRE’s inside the company is to limit their growth, while enabling Google’s growth. That means, as Underwood puts it, “It’s super important for us that SREs grow sublinearly with Google. We’d like to continue to get more efficient.”
To that end, he said, Google SREs focus in on their applications, specifically. “We focus on a deep level on the specific services we work on. Teams that work on Google Docs, teams that work on ad serving; each team focused at a very high level of detail on those services. At the same time, we have SRE teams that build common infrastructure used across all the SRE teams.”