SRE Tips to Prepare for Black Friday
Gremlin sponsored this post.
Preparing for Black Friday, or any peak traffic event, is an ongoing project for engineering teams who are responsible for building, deploying, and operating production workloads.
Since Site Reliability Engineers (SREs) and engineering teams are probably staying home this year due to COVID-19, instead of making preparations alongside teammates in our offices, we’ll need to accomplish the same work from our workstations at home — where we’ll also convene our war rooms and manage any incidents that may arise.
Here’s a list of the ways that SREs from companies like Dropbox, Amazon, and Netflix have prepared for peak traffic this holiday season.
Review Past Incidents
Reviewing past incidents is a powerful way to gain an understanding of how your system has failed previously; and will offer you a lot of insight into how the system actually behaves in production. Armed with this insight, you’ll be more confident in the case of an outage. Plus it will give you a checklist of questions to ask your teams.
- Have we validated fixes for past incidents in light of any new code changes? To prevent the drift into failure, it’s important to revisit fixes for past bugs to ensure the reliability of code and configuration updates.
- Are we prepared with the right amount of infrastructure and correct autoscaling rules to handle a surge in traffic?
- Have we tested the reliability of our application’s critical paths? Validating that the core functionality of our application will perform under stress will make a massive difference to our company’s bottom line.
Get to Know Your ‘Problem Services’
A pragmatic way to identify “problem services” is to ask your team “which services do folks avoid writing code for?” Once you have a list of these services, you can start looking into how to make sure those services don’t cause any headaches on the big day.
Do a little bit of digging to see how those services tend to fail and how the rest of the system responds. Once you understand the failure patterns of a given service, the reliability mechanisms become more obvious. Does the service need a bit more redundancy? Does it have issues with auto-scaling properly? Is the connection to an upstream service a little fragile?
Run a Remote FireDrill to Test Your Observability and Runbooks
A FireDrill is a planned event that validates people and processes. Specifically, it is designed to run a team through the proper actions to take when a specific problem arises. Like business continuity plans, FireDrills should be a regular and expected facet of our incident management preparation.
Now that we’re working from home, it’s important for us to do a dress rehearsal to make sure that we are confident we’ll find gaps in our process before we end up troubleshooting an incident from the living room in the middle of Thanksgiving. Are our alerts set up properly, or are we getting paged for non-issues and missing alerts for real problems? Will our dashboards give us the right data, so that we can resolve an incident quickly? And are our runbooks up to date, complete, and accurate?
Create a One-Pager for Your Whole Company About the Event
One of the more time-consuming elements of incident management is making sure that everyone is on the same page. Publishing a company wiki page about the traffic spike and sharing it across your organization will save valuable minutes in the event of an outage.
Here’s a starter list of topics you can include:
- Why you expect the traffic spike and how long you estimate it to last.
- Contact information for all on-call people and a link to the rotation calendar (this should be easily accessible in the first place).
- Known system trouble spots, like potential bottlenecks or single points of failure. This allows everyone in the organization to keep an eye out for potential problems.
- Check primary database query plans and any expected query pattern changes, including how long these queries take to run under normal conditions.
- Scaling bounds and known capacity limits, such as a capacity limit on Lambdas.
- Results from Chaos Engineering experiments run on services.
Reproduce Past Incidents with Chaos Engineering
Sometimes we think we have a fix for our past incidents, but we never actually go and test that the fix works. This can be for a number of reasons: inadequate tooling, hesitance to test in production, or perhaps even laziness. But this is a core use case for Chaos Engineering. Because Chaos Engineering enables engineers to precisely and repeatedly recreate turbulent production conditions, we can often reproduce what led to a major incident and verify that a fix does work.
Uneventful Black Fridays
There’s an apt quote for 2020 that goes, “may you live in interesting times.” But when it comes to our on-call rotations and system behavior, we’d prefer things be boring and predictable. We hope that the above list can help your team prepare for a Black Friday full of happy customers and plenty of downtime with your loved ones.