SRE Housekeeping: Where to Get Started for New Hires
I recently joined a new early-stage scale-up, Firefly, and had to build their site reliability engineering (SRE) strategy from the ground up. In this post, I’d like to take a look at the first areas that required attention, aka “housekeeping,” to lay the foundation and groundwork for our future operations. We hope this post will be beneficial, both to first SREs at companies and those joining larger teams, and provide a better understanding of the SRE role in production operations.
Once we made progress and achieved better processes and culture in these core areas, we could then focus on even greater challenges of scale and growth. I’ll share how an SRE starts by balancing some critical pieces to ensure the success of the entire engineering organization and business through performance, stability and error budget management, alongside critical cultural aspects of reducing complexity and friction through unnecessary constraints.
Getting Started as an SRE in a Fast-Paced Startup
While there are plenty of posts out there that focus on the differences between DevOps, platform engineering and SRE, this post will focus more on SRE as a discipline — and recommend guidelines for how and where aspiring SREs can get started. Future posts will take a deeper dive into each of these areas and unpack how all of this can be achieved.
In order to make progress in the primary areas under the SRE domain, it is critical to first collect data to understand current constraints on velocity to be able to see how to remove this friction. This process involves speaking with the developers to understand more closely what affects their bandwidth in the following areas:
- Features: How quickly they can develop and release new features
- Cost and FinOps: What are the biggest expenses
- Numbers: Current metrics and benchmarks around performance (if there are any)
- Business goals: Understanding how the engineering team aligns with the company’s goals.
For example, a constant demand from executive leadership (whether the CEO or VP of engineering) is to be able to optimize for cost. However for an SRE, it is important to do so without affecting other critical aspects of engineering such as velocity or stability.
Eventually, engineering isn’t just the cold, hard technical work of shipping code from one side to the other. There is heart in it with people at its core, and therefore good SRE isn’t achieved by just enforcing new law and order from the top — it requires creating a good and healthy engineering environment. Good SRE means helping the developers understand why the changes are being made, and how the changes will improve everyone’s life as well as the business.
When you work together with the R&D organization to make much-needed changes for maximum resource utilization, you open developers’ minds to the many aspects they influence directly, and how they can do their job better. When done right, this often leads to happier developers who have gained maturity in their engineering capabilities in the process.
SRE and Performance Engineering
One primary area that most SREs are asked to focus on (regularly), but often as an important first-order improvement, is performance. In order to begin improving performance, you need to start by measuring, and create a baseline upon which you can then create KPIs and metrics to improve this initial baseline performance. Similar to platform engineers, SREs usually come with a strong engineering background, as one hat an SRE wears is almost that of a consultant and educator of the R&D and its engineers.
An SRE will explore the code and metrics, and enable engineers to look at their code retrospectively, to point them to where in the code they need to touch to improve performance of the system. The SRE will introduce “production principles” — on running code in production and on how the code is written affects the systems. Adding a level of operational understanding to how your piece of code affects the entire system provides a level of growth and maturity in an engineer’s career.
Achieving Greater Stability in the Startup Jungle
Anyone who has served as an SRE in an early-stage startup or scaleup knows that until the SRE shows up, it’s generally the laws of the jungle — where the leading and guiding principle is “ship it.” This means that most SREs show up to a highly volatile environment. When it comes to achieving greater stability — this is done on two levels: the code and engineering, and then the infrastructure. We’ll start with the engineering aspects.
This is an important phase in a company’s growth and maturity, as oftentimes the SRE comes to a startup that already has a product, or a significant portion of it, and not all of the site reliability engineering can be contained from the first line of code. You come into an engineering organization that is already in full throttle, and you need to align your tasks while everything is in flux and constant motion.
Greater Stability through Better Code Practices
Many engineering organizations today have both junior and senior engineers on the team, and all are constantly being bombarded with conflicting and diverse needs of satisfying existing and potential customers, leveraging open source projects, fixing bugs or rolling out features. As a young developer, you don’t always know how to prioritize with the constant bombardment.
Another important thing to be aware of is that this prevalent context switching (which is a byproduct of the competing priorities), also adds complexity, making code more error-prone, and depending on the developer’s maturity, this can certainly have a direct impact on the code’s stability. Stability starts on the code level, and once you have a stable product, you can set out to add greater stability on the infrastructure level.
By creating a greater sense of stability for developers first, you can then do the work to optimize stability on the infrastructure side as part of the SRE practice. However, you will not be able to achieve stable infrastructure without a stable product and code, so one must start there first.
Stability starts with procedures and processes to ensure engineering is abiding by the well-known metrics of the SRE discipline SLIs, SLOs and SLAs (we’ll dive into these under Error Budgets). This is where scheduling, as much as it is debated and controversial, adds much-need stability to the deployment of features and releases.
The intention behind scheduling is to add order to the chaos, not friction. Let’s look at a common practice and example: when deployment scheduling is enforced and a version or feature cannot be pushed after 6 p.m. or on Thursdays (when the weekend starts in Israel). In other locales, this might be Friday, after 1 p.m., but there is a reason for this.
When you work with processes and schedules and failures happen (because failures always happen), it can be more rapidly contained as a strategy. If there is an incident during working hours (or even with a planned scheduled deployment on off-hours), you have engineers available to troubleshoot and mitigate it. If you deploy after hours out of schedule and out of plan, then there is a higher likelihood that you won’t have the right functions in place to help troubleshoot in real time, creating greater system instability (and eating into error budgets, which will be covered later).
In modern engineering organizations, velocity is the leading benchmark and success metric, as popularized by Accelerate and DORA metrics. However, this can’t be done without safety — speed with safety is critical to running a SaaS business, which is the core of the Accelerate research.
In order to achieve rapid and safe delivery, we need to add a layer of testing and automation, as humans are inherently biased and error-prone. This is the same evolution our configuration has undergone, where Infrastructure as Code and configuration management have all but replaced manual configuration (stop ClickOps!) in order to avoid the human misconfigurations and errors manual work is prone to.
Infrastructure Stability and Optimizations
Once we have created the metrics and KPIs we want to optimize for, introduced structure through processes and scheduling, reduced complexity and improved performance, we can then start to optimize our infrastructure through testing and automation.
By creating staging environments as close to production as possible and allowing our code to “live” there for a duration of time or a workload cycle, we can then view the metrics, test and validate the code automatically without human bias or error — we can make data-driven decisions on whether or not to ship this code to production. If the code does not meet the guidelines and metrics defined, it doesn’t get shipped. All of this relies on automation not humans, to add the much-needed layer of infrastructure stability.
A big part of an SRE’s responsibility is to supply data and metrics and to abide by the service-level agreements (SLAs) toward customers and contracts. Noncompliance, essentially errors, is a breach to this contract and agreement, and when you exceed this threshold (the “budget”) for errors, you ultimately need to compensate the client.
What Does This Mean?
If we have a yearly contract with Client X, and have contracted to have 99.999% uptime, this can be translated to the exact number of hours and minutes of downtime “budget” we need to avoid exceeding for this client.
As an SRE, and eventually the overseer of all of the data, you have the bird’s-eye view and scope of all the customer’s current available error budgets. For example, if Client X has had one hour down and Client Y has had 30 minutes of downtime this year, you constantly need to know the numbers and calculate so you don’t breach your SLA contractual obligations.
Based on this known available budget per client, we make decisions whether to deploy problematic code and risk downtime. This is when the SRE needs to calculate whether it’s worth it, and even take precautions whether to exclude a certain client from a deployment to ensure contracts aren’t breached. This is where preparation and proactiveness is critical through relevant monitoring and metrics, so you don’t find out too late that you’ve breached the contract.
SRE Metrics for Success
These metrics are usually defined as the North Star of SRE:
The recommended best practice to ensure you are continuously maintaining your SLAs is that your SLO should always be higher than the SLA. There should be a lot of thresholds you need to exceed before you breach the SLA.
This is also the place that we need to embed robust monitoring and alerting, that is integrated into our CI pipelines, to constantly alert us in the event that we get close to the danger zone.
Reducing Complexity — the Key to SRE Success
Reducing complexity in all of these core areas will eventually be what leads to improved performance and stability, while maintaining a healthy amount of error budget.
If a system becomes too complex, this in turn leads to too many variables to take into consideration, making the core metrics of SRE too difficult to calculate and causing the decision-making process to slow down because decisions become harder to make quickly (due to too many unknown unknowns).
Reducing complexity also has direct business impact. Our engineering organization needs to constantly strive for smart design and development so that we don’t accrue technical debt and our code has a longer shelf life. By freeing up developers from focusing on bugs, technical debt and other common concerns that are a byproduct of complexity, they can then focus on the reason we all go to work every day — business-driven development — our product, features and innovation.