Site Reliability Engineering and the Art of Improvisation
Site reliability engineering (SRE) depends on orchestration and improvisation. To develop a great SRE practice means a deep understanding of the technical infrastructure, but also the confidence to trust your instincts and just start jamming.
I run a weekly continuous learning session at Blameless that takes its title from the traditional Indonesian orchestra: the Gamelan (pronounced “gah-meh-lahn”). This orchestra is mostly percussion, lots of tuned gongs and mallet instruments, a few strung and wind ones, usually a male or female singer, all joining together through rhythm and extemporaneous songwriting.
You see, a key element of gamelan is that the music is written by the group as it is practiced, with the belief that music should grow and change. As they meet time and time again, members are continuously evolving new versions of the songs every time they play. Practices begin to look like performances and vice versa. It’s a lot like improvising jazz, where the gig is merely another time to get together and play.
Some of the ways we transform and evolve our understanding at these sessions include:
- Walkthroughs of observability toolsets, a.k.a. “Morning Vistas”: What do you observe when you open the laptop to start the day and look across the operational landscape? This provides fresh perspectives on how our colleagues approach their regular work.
- Decision requirements table building, for instance the most difficult decisions faced during on-call or live maintenance of our Kubernetes clusters. These help us think about how we can make improvements to support responders making decisions under duress.
- Team knowledge elicitation, like deeper views into NGINX Ingress logging or attempting a dependency matrix for our critical path. It’s very useful for squeezing some of that juicy knowledge out of our experts’ brains.
- Asking the question, “Why do we have on-call?” to share mental models of how different people at the company view and engage with it. We learn about each other’s expectations, how we might alleviate the fears of being on call for the first time.
- Spin the Wheel of Expertise! a.k.a. “Who? What? Where?” Here we explore our technology stack and services through gameplay, asking each person to spin the wheel and require them to show us firsthand how they would come up with the answer, or how they would escalate if they simply didn’t know.
What we’ve created at Blameless is an opportunity for learning and a time to come together in a collaborative way to share mental models and tell stories about different areas of the system in a safe and unpressurized way so we can carry learning forward. This way, incidents are also merely another time we can apply our powers of intuition because we’ve put techniques for addressing them into practice. More precisely, we call this “The Practice of Practice,” which is the experience we absorb when we actually do our craft — improvisation, production, incidents.
My motto has sometimes been that it doesn’t much matter what we do together as long as we’re doing it together. Regardless of attendance, the discussions always dive into shared perspectives and allow participants a safe space to explore things without fear of the judgment or anxiety associated with an incident. It is impossible for any single person to know the full complexity of networked software, so it becomes critical to know where to find expertise and how to learn from doing instead of trying to follow prescriptions or hastily reviewed runbooks.
One of my favorite things about running these opportunities for learning is seeing the participants employ aspects of their regular work while we answer questions or explore one UI or another. This allows others to peek into their coworkers’ mental models. What might seem like mundane, ordinary tasks to one may illuminate an understanding for another, or even lead others to embellish their own patterns and work style.
Our themes and agendas are somewhat loose but usually planned so we’re not just staring at each other. Nevertheless, sometimes we are required to adapt. There was one session held on the same day as a large vendor outage that disabled our ability to use a portion of our own UI to support that day’s game. So, we pivoted, and it became a session with two of our experts on the subject of the vendor outage, which in this case it was root certificate authorities and the SSL/TLS protocol.
Although there is an emphasis on the operational parts of our complex system, the participants are far from just infrastructure engineers and SREs. We have sessions including people from technical writing, software development, customer service, strategy, marketing and even management. We make the calendar invite optional, companywide and we do not call it a meeting: It’s a session, where we can share stories and have fun in a live setting.
In all these activities we seek to open doors that people might be afraid to go through, learning by experiencing how our peers answer questions about a service or technology. We pick up on the patterns and praxis of others, and this enriches our own set of intuition responses, creating new pathways and new connections in our own mental models. This enriches our view of the system and provides the foundation to be adaptive when responding to incidents.
Build to Adapt
In the grand socio-technical scheme of things, “the Practice of Practice“ enables us to build upon the resilience that blossoms like the harmony of well-practiced jazz musicians. The magic and excitement found in discovery is food for our brains. Our synapses hunger for enriching pattern recognition, combining new experiences with old ones and other mental models to form new ones.
The superhero-like power of instantly pulling solutions out of seemingly nowhere has its origins in bringing our practiced scales, melodies, theories, rhythms and other patterns together with inspiring combinations.
Instead of suffering the stressful common-ground breakdowns during incidents that translate to a poor customer experience, we seek new ways to choreograph our socio-technical systems more confidently. We see as an organization that there is power in this kind of collaboration; participants have praised these sessions as some of the best on-the-job learning they’ve ever done.
So it is true that having a firmer handle on how to cope with rather than eschew ambiguity comes directly from knowing how to do our jobs better at the sharp end. But we’re not in it alone. We do this by drawing on our rich network of humans in collaborative joint activity, recognizing how our regular work interrelates and feeds into the very complexity we seek to understand.
It’s not a whole lot different from the way musicians influence and support each other through their playing. Imagine how extremely uncomfortable events can be lightened by an unassuming session on what choices you have when your very reliable servers go down. Incidents are unplanned and can thus be intimidating, but the team has got your back. This is a situation you have all practiced, so it’s just another time you’re getting together to make music.