The Best Site Reliability Engineering Tools in 2021
Site reliability engineering (SRE) is an exciting field to be in right now. That’s not only because SREs bear a unique set of responsibilities, but also because they normally have the freedom to choose their data engineering tools and technologies so they can prioritize reliability in their day-to-day operations.
In a previous article, we discussed the most practical tools for SREs and explained how they help achieve high reliability, effective communication and transparency. In this post, I’ll outline my own personal favorite SRE tools.
Key APM (Application Performance Management) and Monitoring Tools
Although there are many good APM and monitoring tools available, the three below are my personal favorites:
- Datadog: Datadog is marketed as a cloud monitoring solution, and it offers almost everything you need in that regard. For example, you can set up monitors, review current infrastructure hosts, collect events, add synthetic and RUM monitoring and so on. It offers plenty of opportunity for customization, and it integrates well with other systems. Although the UI looks rather bloated, and it takes time to learn its query language, you will be able to get the most out of this service with proper training.
- Kibana: Kibana is a free data visualization platform that collects metrics, typically from Elasticsearch clusters. If you are using the Elastic stack (ELK stack), then Kibana is the most suitable tool for the job. Kibana also offers many other services such as SecOps and business analytics that make it a valuable tool. Since it’s free, Kibana is a good option for small businesses and startup too.
- New Relic: New Relic is a cloud-based platform that specializes in observability, telemetry and monitoring performance. It is used to track the performance characteristics of distributed services and applications within a single dashboard, and it’s primarily targeted at large-scale enterprises.
Key Automated Response Systems
In my opinion, the top three automated response systems for SREs are:
- PagerDuty: PagerDuty is a cloud-based incident response platform that specializes in on-call rotations and incident management. It can integrate with many providers and services, and it works well when real incidents occur. The company’s pricing model is quite affordable, and the product is suitable for all types of businesses. You can even receive calls and notifications on your phone or smartwatch by installing the native app.
- VictorOps (Splunk On-Call): VictorOps is now part of Splunk On-Call, which is another good option for an enterprise-level incident response system, although it’s a bit pricey. If your organization already uses Splunk, it makes sense to adopt its on-call option as well.
- Opsgenie: This incident response platform is part of Atlassian, so it’s a good option for those who like to work with its products. The pricing model is also quite affordable — it even has a free tier, with basic alerting and on-call management for up to five users.
Key Real-Time Communication Tools
Using real-time communication tools greatly improves response readiness. My top three recommendations in this category are:
- Slack: Slack, which is now part of Salesforce, is one of the most popular platforms for real-time communication. It is intended to serve as a primary collaboration tool for teams and businesses, but you can also use it as a programmatic platform for automating events or responses. Its pricing model is geared toward all kinds of businesses, and it even has a free version (with limited messages). Slack can also be used to set up ChatOps services and other hooks.
- Microsoft Teams: Teams is Slack’s main competitor. It might be a good solution if your organization is already using Microsoft products from the Office 365 suite. There is even a free version for up to 100 users.
- Telegram: Telegram is a simple and reliable messenger platform for all kinds of teams. Better yet, it’s free and offers an API for programmatic access.
Best Project Tracking Tools
You should also keep track of incidents and record events using a project tracker. I would recommend the following:
- Jira: This is Atlassian’s main and most ubiquitous product. It’s an agile platform for tracking projects and team progress, and it’s used by professional organizations of all sizes. The downside is that it looks and feels very slow sometimes.
- Trello: This is also from Atlassian, but it’s more approachable and easier to use than Jira. You can get started with Trello for free, and it scales really well without much investment.
- Asana: This agile project management service is free to start and grows with your business. It’s a good alternative to Jira and has a growing user base. According to Crunchbase, it brought in revenue of $142.5 million last year, which isn’t bad considering that the online project management software market value is in excess of $4 billion. If you are not happy with Jira, you can try out Asana as an alternative.
Best Infrastructure Deployment Tools
Finally, SREs will want to automate part, or even all, of the deployment infrastructure. The following are good tools for doing just that:
- Terraform: Terraform is a tool from HashiCorp that symbolizes the term infrastructure-as-code (IaC). It allows DevOps teams to describe their infrastructure components, such as VM, Kubernetes clusters, databases, or VPCs using a domain-specific configuration language. Then it takes these descriptions and creates the infrastructure components in the cloud environments. Terraform is practically a must-have tool for development, and the good news is that you can start using it for free.
- Ansible: Ansible is a tool that automates IT infrastructure. It primarily uses YAML files to describe roles, services and tasks that need to run in a specific order. On each run, Ansible will connect to machines using SSH and run the tasks described in the playbooks as scripts. At that point, it removes any scripts or temporary information from the connected hosts and reports the status back to the user. Since it’s written in Python, Ansible is very extensible and it can handle a wide variety of roles and scripts.
- SaltStack: SaltStack is another IT infrastructure and configuration management tool with an unusual approach. It relies on agents installed in hosts, which then use a data-driven orchestration of communicating commands. If configured correctly, it can automate deployments into thousands of nodes with minimal effort. SaltStack was acquired by VMware, so its future now depends on VMware’s vision.