The Art of DevOps Communication, at Scale and On-Call
This post is the latest in our series on “Cloud Native DevOps,” sponsored by CloudBees. Check back each Monday for additional posts.
Each DevOps transition is as unique as the company that is going through that silo-breaking, restructuring, cultural change. But there are some similarities found in companies of all sizes. DevOps is meant to unite tech teams around individual ownership, responsibility and feeling like each person has a stake in the business. How you communicate and scale these changes dramatically affects the success or failure of a DevOps transformation.
For this article, we spoke to a team at tech giant Microsoft that is about a year into its a DevOps transition and five-year-old startup turned scale-up Klaviyo that has been DevOps from Day One. No matter where in your DevOps process you are, you surely can use the advice from these software reliability engineers (SREs) about communication during key points in DevOps transformation.
The Art of DevOps Communication at Scale
DevOps takes a certain personality profile. Since companies are in constant flux, like Tetris pieces continually trying to fit together in a new way, you need a team of individuals committed to a common vision. For Senior Site Reliability Engineer at Klaviyo, Laura Stone, this all comes down to hiring the right people.
Klaviyo, a marketing automation tool focused on applying big data and segmentation to really personalized marketing, is five years old and has been DevOps from the start. Of course, for the first two years it was just the founder. But in the last three years, it has had to scale up their DevOps to a team that’s grown to 150 people. That’s why the interview process is more of a culture and code test than a Q&A.
The first step is a take-home assignment or, what Stone calls, a simulation for what it’s like to work at Klaviyo. Candidates have to write a small CRUD (create, read, update, and delete) application that deals with weather data and sends people a personalized email. The right candidates don’t have to be masters of certain languages but they must show they are eager to learn and that they are thinking of the next person who has to use that code, including attention to documentation, algorithms, readability, and cleanliness.
”I love it when people write tests — it shows that it’ll be easier to use your code in the future. I think a lot of people should test their code and document their code and they just don’t do it,” Stone told The New Stack.
And DevOps isn’t just about scaling a company but scaling a code base.
Stone gave the example: “Let’s say you had a 100 people signed up to this service and 40 people are in Boston, do you make one API call for each or create it and use it once and cache it?”
“I don’t look at a ton of resumes — they’re very difficult for SRE roles. Show me the code.”
— Laura Stone on importance of hands-on DevOps interviews.
Once you pass your first test, you come in for some collaborative coding.
“The interview process that we have is set up to simulate what it is like to work here,” said Stone.
“You are put in charge of refactoring. They don’t need to look stuff up. They can ask us anything.” She continued that it’s to understand how they work and look for candidates’ “openness around when they don’t know something.”
The next part of the interview test is about of DevOps ownership. They inform the candidates that they now own this code, asking them questions like:
- Where do you want it to run?
- How do you want to be notified if it fails?
- What’s the infrastructure it’s running on going to look like?
- How would you scale to different user needs?
Stone says they aren’t looking for candidates who know all the answers, especially those fresh out of university, but they want to see signs of a desire to want to be on-call and to own their service from creating to maintaining it.
The team at Klaviyo isn’t looking for precise answers but for a candidate’s ability to think ahead. They should suggest running their code on a machine, not their computer, so they aren’t tying it to one person and personal infrastructure. They should think about where automation can speed up processes.
“Someone who is ready to be in a DevOps culture would have this culture,” Stone said.
Most importantly they need to see signs of customer empathy and a desire to make sure the solutions are as stable as possible.
Stone says they are looking for “People who are motivated to learn and who are technically savvy and who can show empathy. Given the right structures and resources in place, then they can be successful owning their service from start to finish.”
She continued that “As the company scales, we’re being very thoughtful about how we codify processes within the engineering team because even people who are motivated and highly skilled and empathetic, if they don’t have the right structure and resources in place, things still won’t work.”
When Stone joined 18 months ago, there was just one product team and an SRE team. Now there are an additional four or five product engineering teams focused on specific areas of the solution. They have a greater need to concentrate “knowledge transfer as engineers can no longer know the entire product and, in many cases, can no longer have deep relationships with other engineers on the team.”
Scaling DevOps all comes down to one question: How can communication flow and how can people still have ownership?
One form of scalable knowledge transfer they use is mob programming. It’s like agile’s pair programming, but the whole team is working on the same thing at the same time on the same computer. They did this when adopting Terraform to automate their infrastructure.
“No one was familiar with it within our organization so I had to come in and teach Terraform. We did a mob programming session where the SREs acted as mentors to the product teams,” Stone said.
She thinks this knowledge transfer is working because it used to be exceptional if she didn’t get paged on-call. Now it’s exceptional if she does.
The Art of DevOps Communication During Incidents
One of the important aspects of DevOps is breaking down barriers and providing cross-functional training so everyone feels an equal responsibility for code that’s being released. For most DevOps teams, that means streamlining incident response and instituting all-hands-on-deck on-call rotations. Because when you are trying to create a world of always-on continuous delivery and integration, you need people willing to work increased uptime, any time of day or night.
“DevOps is a buzzword, so given that we moved to DevOps, it was really up to us to define what that means for our team, which for us meant everyone is responsible for infrastructure and you’re on-call,” Senior Site Reliability Engineer Nida Farrukh told The New Stack, talking about her personal experience as part of Microsoft Social Engagement and Market Insights (MSE) DevOp’s journey, which began a year ago.
This small, nimble team is working to minimize downtime for thousands of customers. In a recent restructuring, a handful of SREs are now sharing infrastructure and on-call responsibilities with developers.
“We moved to DevOps [and] it was really up to us to define what that means for our team, which for us meant everyone is responsible for infrastructure and you’re on-call.” — Nida Farrukh
“This generally increases the health of your monitoring system because the people who are writing the code are fixing it and feeling the pain points too,” explained Farrukh.
Each team member is on call for one week at a time. To start, each trainee has a shadow week, acting as back-up for an SRE or another fully-trained dev. Then, within a few weeks, the trainee takes on the role of primary administrator on duty or AOD, and the more experienced person shadows. There are also ample tutorials, docs and regular simulated outage exercises to assist.
“People generally are very nervous when they go on call for a service for the first time and they’ve never been on call for anything,” Farrukh said.
She continued that it isn’t about knowing a product inside out, it’s about knowing where to find what you need:
- Do you have the tools to debug a new problem that comes up?
- Do you know where to find the answer in documentation?
- Do you know who can answer doubts and how to contact them?
On the MSE team, while there is some leeway to choose appropriate tools for tasks, the infrastructure team works hard to keep standardization across their stacks, with a limited number of logging systems, languages, libraries and monitoring systems, so everyone shares a baseline knowledge.
Ask for Help
A lot of times new folks can feel afraid to ask for help, but Farrukh’s team works hard to emphasize that everyone can and should ask for help because, in DevOps, the entire product is the entire team’s responsibility.
“In the MSE team, as AOD you are first responder, not sole responder. You can pull in anyone in the entire team to help you during an incident,” she said.
The engineering leads and SREs have even gone out of their way to volunteer to respond any time.
“An AOD has the power to call anyone but there is some sort of psychological barrier so these people have come out and said ‘Please call me’,” Farrukh said.
She said software architects are usually a good first contact for issues within a DevOps organization because if they don’t know exactly the problem, they’ll know who to call.
Farrukh opines that AODs can and should delegate tasks that act as distractions for their main goal — debugging efficiently. Even looking for the right contacts and then calling them can be a distraction.
She suggests two roles to help limit these interruptions: bridge manager and comms person.
The bridge manager can be in charge of bringing on everyone that should be on a conference call about the issue. The bigger the issue, the more people may be called in from the escalation matrix.
Other engineers, execs and even marketing and customer service often have loads of questions that can distract the AODs from their work. The suggested comms role — something Farrukh says she often volunteers for — fields those questions and is the main point of contact for the AOD to disseminate information through.
For smaller issues, the comms and bridge managers may be the same person or even the AOD herself for really small issues. In reality, on the MSE team, the AOD often has five to six engineers helping her with larger problems, including SREs for faster deployments and rollbacks.
“We tend to know how the infrastructure works and how to put in place workarounds. We either advise the AOD or take responsibility for specific tasks,” Farrukh explained.
To learn more about DevOps communication in incident response, Farrukh recommends you study the examples of active, analog response systems like the Incident Command System developed for wildfire response in California and the U.S. National Transportation Safety Board.
The opinions and views expressed in this article are those of the interviewees and do not necessarily state or reflect those of Microsoft.
Feature image via Pixabay.