How HashiCorp Does Site Reliability Engineering
The release of a new upgrade or product is a triumphant event for any organization. But it holds the potential for massive, embarrassing failure if a company’s systems can’t handle the spike in traffic when customers rush to buy or download it.
Fortunately, HashiCorp has a team of site reliability engineers (SRE) — several of them, actually — who foresee high-traffic moments, help optimize system operations to handle them, and ask the right questions ahead of time.
Such as, according to Martin Smith, one of its staff SREs, “How many customers do you really want when you announce this new product at HashiConf?”
Site reliability engineering, said Smith in a presentation at HashiConf here on Wednesday, “really requires a lot of product input to do well.”
Smith and Patti Borne, senior manager of site reliability engineering at HashiCorp, have worked on the company’s Core SRE team for three years and two years, respectively. The core team was assembled from a variety of disciplines, including developers and architects as well as operations engineers, Borne said.
“We just require that they have a passion for reliability,” she said.
Embedding with Teams to Uncover Problems
HashiCorp began what it calls its “production engineering” journey in late 2019. At first, Borne told the HashiConf audience, it focused its SRE efforts on everything except product code.
In June 2020, its SRE work started to also focus on infrastructure, including the company “stack” and HashiCorp Cloud Platform infrastructure.
A year ago, the company’s SREs began seeking ways to improve developer productivity and increase time to production. In addition to the Core, infrastructure and developer productivity SRE teams, another reliability team works with the product teams.
The SRE teams work in a number of ways. One of the things they do is embed with other teams, such as product teams, for specific projects. They gather data, ask questions and help identify patterns.
It gives the teams the SREs embed with “another lens on their code, because they’re so close to it,” Borne said. “Just a second set of eyes sometimes.”
They also create tools to help developers achieve better reliability and velocity. For instance, she said, “We set up monitors for them to just set up, plug and play.”
Building solutions is much of what HashiCorp’s SREs do, said Smith. They emphasize automation and reducing toil. “We want our SREs to write code to solve problems,” he said. “We want at least 50% of your time as an SRE to be spent writing code.”
Gathering and analyzing data about how systems and applications are working, and identifying patterns that can indicate inefficiencies or problems, is an essential part of HashiCorp’s approach to site reliability engineering, he added.
“Getting that data in front of teams is an important part of what we do,” he said. For example, he advocated “aggregating the number of the type of incidents and putting that in front of teams.” Such insight can help point developers and engineers toward the specific culprits of recurring incidents, and help them solve them faster.
How the Core SRE Team Works
The Core SEO team, said Borne, spends about 40% of its time doing consultative work, such as:
- Holding SRE office hours.
- Troubleshooting hard problems.
- Conducting a request-for-comments review for everything from architecture to service level objectives (SLOs).
Another 20% of the team’s time is spent on process improvement, including such tasks as:
- Incident management.
- Reliability reviews.
- Heightened awareness of emerging issues.
Another 20% of the time is spent embedding in projects around the organization, working on such things:
- Launching a new service
- Improving reliability patterns
- SLO improvements
The final 20% of the Core SRE team’s hours are spent on reliability tooling tasks, such as:
- Creating reusable monitors and components.
- Incident tooling, such as making a status page.
- Creating a catalog for automation.
Reliability as a Product Feature
As its SRE teams have been built out, Smith said, they have been guided by a set of core values. Defining those values has also helped define the scope of the teams’ work.
HashiCorp’s SREs define reliability as “ the system is trustworthy of performing consistently well.” “It’s been helpful to use this definition when talking to teams,” Smith said. Because otherwise, “sometimes it feels like everything is included.”
The company, he noted, has a culture of “doing a lot of research about what the customer wants in the beginning.” And SREs, Borne said, are “the last team in the room advocating for the customer experience.”
The SREs treat reliability as a customer-facing product feature and evangelize that line of thinking throughout the company. It urges internal teams at HashiCorp to design with reliability in mind. “We have this conversation a lot around architectural product decisions,” Smith said.
Its SREs request internal comments whenever something is being built. It asks questions like, “Have you thought about what the architecture you’re building means for maintenance windows?”
Smith noted, “Every customer cares about reliability. When you ask, ‘What kind of features do you want in Consul?’ They might not mention this. But I guarantee you, if it’s down, they’ll mention that it’s a big problem.”