Handling Cyber Monday with a Mature Engineering Team
Holiday season 2020 was unlike any other. Shopping habits were perhaps irreversibly changed as many shoppers pivoted towards digital. According to Finances Online, 91% of shoppers shifted their buying habits due to COVID-19 with 58% buying primarily online. And what was the biggest day of the holiday season? It wasn’t Black Friday.
Cyber Monday has taken the world by storm and extended the highest peak shopping period by a day. Last year, according to Adobe Analytics, Cyber Monday outperformed Black Friday for the highest overall sales at $10.8 billion, growth of 15% over the year before.
Retailers looking to capitalize on the holiday retail shopping frenzy need to sustain peak levels of traffic from Thanksgiving evening all the way through Cyber Monday. Uptime and availability are a must to serve and convert as many sales as possible, but this also puts an incredible amount of pressure on the technical teams responsible for monitoring and ensuring this reliability.
Teams who are looking to weather the storm should focus on improving their digital operations maturity to plan and build for resilience. A key part of this is putting processes in place to best mobilize a swift, coordinated response to mission-critical urgent work. Investing in the cultural transformation required to sustain best practices for system resilience and response will be well worth the effort if the entire retail machine is humming without issue when the organization’s largest revenue period is on the line.
Operational maturity is a measure of the overall consistency, reliability and resilience of IT infrastructure, including how it is managed and maintained. In the context of digital operations, operational maturity is how prepared an organization is to detect, triage, mobilize, respond and resolve outages or system failures.
Improvements to operational maturity can unlock business benefits, including improving revenue, employee burnout levels and retention, increasing system reliability, maintaining brand image and more. PagerDuty has a high-level model of operational maturity as well as several sub-models that explore individual aspects. Below is how we see operational maturity through the lens of incident response, a key component of digital operations and crucial during intense peaks of traffic.
Moving from one level to another isn’t something most organizations can do overnight; this is a long and iterative learning process. Over time, your team’s culture will change, not just the processes and technology. This is a long-term investment that’s more complex than just investing in the right tooling. However, taking the time to work on your operational maturity can make the next Cyber Monday easier and each one after more successful.
If your team is looking for some quick tips to help make this upcoming Cyber Monday a success, you can check out our hypercare checklist. If you’re interested in hearing how you can set yourself up for success beyond 2021, read on! Below we’ll break down what your teams can do to increase their maturity over time.
From Manual to Reactive
The manual stage is characterized as chaotic, ad-hoc response. There’s no defined processes for incident response, or teams must rely on centralized incident management with primitive tools such as spreadsheets and call trees. This is the least mature stage. From here, teams want to move into the reactive stage, which involves starting to swarm incidents. This means a large group of people are all mobilized to resolve it. While this isn’t targeted, it at least ensures that the issue is receiving attention.
To move from manual to reactive, you’ll need a few things to establish a baseline of competency including defining KPIs and goals, priority services, and key people and their roles. After this, you’ll need to:
Determine communication and notification methods: Define communication and notification methods by adopting modern monitoring and alerting. Monitoring tools will discover anomalies in the way the system is functioning — ideally before your customers do! — and send that information to your alerting and notification platform. From there, your alerting and notification platform will send out the alert to the associated stakeholder, eliminating some of the manual toil that comes from outdated call trees and constant “eyes on glass.”
Define a standard incident response process: A uniform process can speed up the response time and help on-call team members feel more confident. With your team, determine:
- What classifies an incident as a particular severity.
- What roles are involved in an incident per severity.
- When to escalate and to whom.
- What checklists can help teams as they work through the response process.
With these changes, you won’t have to rely on customers to tell you when something is wrong. It will improve your reputation, as you can get in front of an incident by proactively communicating the issue to customers. During peaks where degradation and outages might mean your customers abandoning their carts, a short customer email explaining the issue or a social media post could save you lost revenue.
Reactive to Responsive
Reactive mode has several benefits for teams who previously scored as manual. It’s certainly not the ideal state, though, as it still requires many people to take time away from planned work to fight fires, and it’s not targeted. You want to move away from mobilizing dozens of people for a Sev2 issue if you really only need a core team of five or six people to assist. It’s time to look at how your teams can move toward responsive, which is the stage where most PagerDuty customers find themselves.
Responsive is characterized by intelligently swarming issues. Teams will have defined processes and tools for directing the issue to the proper person. While assembling a response team might still be a challenge at this point, at least the initial alert is going to the correct person who then is responsible for looping in the rest of the relevant stakeholders. To get here, teams should map services to teams and create on-call schedules that route to the right person.
Mapping services to teams might take the form of full-service ownership, where people take responsibility for supporting the software they deliver at every stage of the software/service life cycle. At PagerDuty, we think about a service as a discrete piece of functionality that provides value and that is wholly owned by a team. To define services:
- Look at which team owns the services. In the spirit of full-service ownership, this should be the team that builds this service. If multiple teams work on the service, determine if it really needs to be split into smaller distinct services or if one team can take ownership of it.
- Determine what the on-call rotation looks like. Once you have your team determined, use your on-call platform to determine who answers to pages. Create a rotation and pilot it before finalizing. Your team can help figure out how to best balance the schedule. With full-service ownership in place and an established on-call rotation, it’ll be easier to ensure that incidents are routing to the right person based on service and who is carrying the pager.
- Examine whether the team supporting the service is correctly sized. If you have a small service with a multitude of team members on call for supporting it, see if other teams need help. Or if you find that you don’t have a large enough team to support the service, bubble up the resourcing needs.
Once you’ve defined your services, create a lightweight service graph that can depict the overall architecture of your system and who owns what. You can flesh this out over time, and as you get more comfortable with this view of services, you can include mapping to business services and articulating dependencies.
Responsive has several advantages over reactive. One of the most important ones is cost efficiency. By intelligently swarming the problem, you no longer need dozens or hundreds of tangentially involved people on big conference calls to resolve incidents. Instead, a targeted team can handle it and loop in more people as necessary. This also helps mitigate alert fatigue, as alerts should be more targeted. This is especially important during peak traffic, as technical teams can’t afford to waste cognitive capacity on alerts that aren’t pertinent to them.
From Responsive to Proactive
The next stage is proactive, where you’re fixing issues before your customers even notice there’s a problem. Incident management is seamless and coordinated. Rather than just resolving issues as they occur, you can resolve them before the customer is affected.
To reach this level of maturity, you’ll need to work on optimizing and standardizing your priority service responses and map business services and dependencies.
Standardizing priority services responses: This is about mitigating failures with the highest customer impact by creating response plays for common failures. In this stage, you’ll want to include runbook automation to help run scripts and provide additional context for responders. In advanced practices, you can even configure runbooks to auto-remediate issues without any human intervention.
PagerDuty customer Parsons is using Rundeck runbook automation to uplevel incident response practices. The Parsons team suggests beginning with small chunks and low-hanging fruit. From there, you can add on sequences to build out more robust automation. As you do it, Parsons recommends that you share your automation between relevant teams. The more you talk about automation and show how it improves the lives of responders, the more teams will adopt it. To learn more about Parsons’ automation initiatives, see this video.
Mapping technical services to business services: In the previous maturity stage, you created a rudimentary mapping of your technical services, but you’ll also want to map these technical services to business services and include dependencies. This helps you gain full visibility into the issues you experience. You’ll understand both the downstream technical effects as well as the business impact of each incident and prioritize accordingly.
To do this, you’ll need to work cross-functionally with other line-of-business stakeholders to ensure that the technical pieces align with the business units. You’ll also need to work with teams that own tangential or dependent services to determine how services affect one another. You can iterate on your previously built service map, making it stronger with each addition.
This helps drive down resolution times and can help prevent customers from even realizing an issue even happened in the first place. In peak periods where downtime can reach hundreds of thousands of dollars per minute, this is key.
From Proactive to Preventative
Preventative is the last stage in the digital operations maturity model. This is characterized by getting ahead of issues before they even start and providing excellent customer experience. While proactive might mean seamless incident response, preventative goes a step further and asks, “Does this even have to be an incident at all?”
By learning from previous incidents, teams are able to create and optimize feedback loops for continuous improvement and better team health and performance. One key way to achieve this is by conducting thorough and regular postmortems on incidents and prioritizing improvements gleaned from those.
When conducting a postmortem, it’s important to remember to keep it blameless. Rather than blaming people or labeling incidents as “human error,” dig into what happened during the incident and find moments where teams can make their jobs easier or the system more reliable.
By prioritizing these improvements, teams can minimize the amount of unplanned work, have fewer off-hours interruptions (a cause of burnout and attrition), and provide a fantastic customer experience.
Here are some tips for an effective postmortem:
- Make sure the timeline is an accurate representation of events.
- Define any technical lingo/acronyms you use that newcomers might not understand.
- Separate what happened from how to fix it.
- Write follow-up tasks that are actionable, specific and bounded in scope.
- Discuss how the incident fits into our understanding of the health and resiliency of the services affected.
While reaching this level of mastery ahead of this year’s Cyber Monday might be too tall a task for most teams, it’s a great goal, and one you can work on step by step. The benefits for your teams, customers and bottom line are well worth it.
Operational maturity is an evolving and continuous process. As you move to each stage, you’ll learn more about the systems and services you support, what technology choices best support your goals, and what you want your cultural evolution to look like.
This maturity model encompasses how to improve your incident response, but there’s more to overall maturity than this process. In addition to incident response, you’ll need to improve your event intelligence management, automation capabilities and customer service processes. As you make investments in each of these areas, the improvements will ripple across maturity areas. Operational maturity is an excellent success indicator for how well your organization will manage peak traffic times like Cyber Monday.