The Site Reliability Engineering Tool Stack
Site Reliability Engineering (SRE) can mean different things to different companies; and operators that are responsible for reliability typically use a DevOps toolset. However, one thing is certain: SREs combine the skills of software engineers and production plus operations management, to achieve high reliability and ensure that SLO/SLA targets are met. So SREs not only need to get a firm grip on the technologies involved in the system, but also on the intricacies of production deployments. Plus they have to have to develop and execute incident response processes.
Fortunately, there are many tools and technologies that can aid their work. This post will discuss the most practical tools for SREs and how they help achieve high reliability, effective communication and transparency.
Top SRE Tools
Let’s explore some of the most critical tools and services that can aid SREs in their day-to-day operations.
In general, you won’t see any real difference in the production tools used by Sysadmins and SREs. The point of divergence is actually in the ways in which SREs leverage those tools; they adopt specific principles and best practices in order to achieve high reliability.
The following are the most useful tools for SREs:
APM or General Monitoring Tools
The first thing that SREs need to do is to configure effective methods of measuring everything and capturing reliability targets. By measuring the right actionable data, along with the right criteria and thresholds, SREs can allow the rest of the tools that depend on that information to work reflexively. The question of which tools for APM and monitoring are most useful for SREs has been the subject of much discussion; and each tool has its pros and cons. In any case, though, the reliability of the system itself is paramount — since the tool will monitor all information sources and integration tools. An incident would be a terrible time to discover that your tool was not gathering and processing any information.
Automated Incident Response Systems
Sometimes systems fail and an experienced SRE will have taken steps to protect against that. However, bad things that are beyond anyone’s control can still happen — such as Cyber attacks, DDoS attacks, and hardware failures. Therefore, it is essential to have a set of tools and controls in place that can deploy the right people, processes and information if such a disaster occurs. An automated incident response system will do exactly that; and it will often enable additional integration with monitoring tools and communication channels. In order to reduce informational silos, it is particularly important for SREs to share the ownership of every incident between all related parties. Ultimately, the goal is to reduce the toll on any one team, by allowing all interested parties (like devs and managerial staff) to participate in resolving production incidences. Useful tools for this include Opsgenie and PagerDuty.
Real-Time Communication Rooms
Various channels of communications should be established for handling incidents, keeping track of their status, and even pinging other SREs to help. Real-time communication is essential; and ideally, you should set up a quick response system that alerts the right people and accounts for the status of each employee (including time zones, vacation, and sick time). Once that is in place, you can triage incidents and set alerts for changes that might trigger an incident. Tools like Slack, Mattermost and MS Teams offer excellent features and integrations for successful communications.
Project Tracking Tools
Incidents need to be logged and tracked so that there is a clear trail of documented events. Ideally, this process should be automated — but doing it manually can also be a good choice, especially if your tickets require a certain level of detail and quality. These tickets often act as live documents that detail ongoing issues and alerts, and they can be very useful when passing the task from one employee to the next. Once all of the issues are resolved, they can be archived or logged in a more standardized format in a company wiki. SREs are responsible for being on top of the contents of this documentation, since it may later be used for postmortems or auditing purposes. Tools like Jira, Gitlab and Pivotal Tracker are very handy.
IDEs and Programming Editors
Part of an SRE’s job is to jump into the code editor and push fixes; and this flexibility helps protect the business from failure. SREs can rectify bad deployments or revert bad commits in line with the error budget. To do this, they will need to know their editors and understand the inner workings of application software — at least at a basic level.
Their development environments should be configured beforehand to minimize any time-consuming blockages — such as when unstable commits are deployed in production, or a missed variable was uninitialized — so they can fix issues quickly. Once the issues are resolved, the SREs can monitor the behavior of the system and make sure there aren’t any lingering side effects.
The Road Ahead
To meet these high expectations, SREs need a vast array of tools and services to ensure the reliability of the system. Without them, handling the various reliability metrics and factors can become unwieldy to say the least.
This is where StackPulse can help. Stackpulse offers a complete, well-integrated solution for managing reliability — including automated alert triggers, playbooks, and documentation helpers. Try out this demo to see what we have to offer.