Modal Title
Machine Learning / Observability / Tech Life

How Amazon Prime Video Engineering Builds Team Resilience

At Chaos Carnival, speakers from the streaming giant revealed how they are using machine learning to help avoid incidents when traffic spikes.
Feb 8th, 2022 3:00am by
Featued image for: How Amazon Prime Video Engineering Builds Team Resilience

Highly distributed software systems allow organizations to scale faster, and give their engineers more speed, control and autonomy. But as companies scale, so does complexity.

No company knows that better than Amazon Prime Video, which employs that means cross-functional teams in multiple geographies serving millions of users across thousands of APIs. With the added complexity of live events and popular streaming premieres, its teams have had to learn to grapple with sharp spikes in traffic and workloads.

As a result, continuous resilience is at the core of Amazon’s cross-organizational success. For Prime Video, that comes down to addressing scale, complexity and impact on customers.

Two memorable sessions from January’s Chaos Carnival, the annual users’ conference hosted by ChaosNative, came from Prime Video’s resilience and chaos engineering team, which centers on supporting the company’s DevOps teams in continuously improving how they predict, prepare, operate, and learn.

In order to keep up with often unpredictable traffic loads at a global scale, Prime Video is experimenting with a mix of human creativity, machine learning and team resilience scores.

Supporting DevOps is written with a cycle that has predict > prepare > operate > learn in a continuous loop. Next to it has the following related points: - proactive reliability - provide actionable insights - build tools

Machine Learning for Continuous Resilience

Olga Hall, director of technical programs at Prime Video, kicked off the panel on achieving continuous resilience in DevOps with machine learning — by reflecting on Kiwi cricket.

Her team was preparing to release Amazon Prime in New Zealand, so it had to be ready to live stream the popular matches, which can be played over three to five days, lasting at least six hours per day. The peak in traffic is usually when everyone tunes into the final hour of the match — but the timing of that viewership spike isn’t often predictable in advance.

Modeling after lessons from DevOps bestsellers “Accelerate” and “Architecting for Scale,” Hall’s team ran some science experiments ahead of its most southern launch to date. It created machine-learning models to figure out workload shape and demand, applied chaos engineering to simulate failure scenarios, and practiced incident recovery.

One of Amazon founder Jeff Bezos’s mantras has famously been: “Good intentions never work, you need good mechanisms to make anything happen.”

Hall’s team looked for ways to build continuous resilience mechanisms around five principles:

  1. Workload modeling. What kind of event, with what audience size, at what time? Which devices? Which regions? While customers are watching live sports, what will happen with the on-demand content?
  2. Play game days around those models.
  3. Run failure injections in parallel. In addition to automatic load testing in production across all the services during off-hours for customers, the team performs stress testing and performance testing in non-production environments. It also runs injections for latency, always checking for consistent timeout settings.
  4. Contingencies and alternate pathways. Making sure fallbacks and failovers automatically kick in at varying levels of architecture.
  5. Observability across everything.

Hall’s team uncovered a pattern and decided to split the experiments into two buckets — controllable and uncontrollable.

  • Controllable inputs, or workload modeling and game days, are where the team applies machine learning to run continuous or timed experiments.
  • Uncontrollable inputs, or failure injections and contingency planning, are when humans can make decisions, so engineers can have fun experimenting.

Observability is needed across both to really gain from this mashup of automatic anomaly detection and human-led scientific experimentation.

The next step will be applying machine learning and artificial intelligence to things that the team so far deems uncontrollable.

Machine Learning for Workload Forecasting

Workload forecasting is rather like weather forecasting. You’re predicting how workloads will vary in the future under increasingly complex and unpredictable circumstances.

But while the climate crisis is making historical patterns less reliable as the basis for forecasting, at Prime Video, teams rely on what Ali Jalali, an Amazon applied scientist, dubbed “normal circumstances” — before performing experiments like suddenly increasing a customer base.

In the same Chaos Carnival panel where Hall spoke, Jalali said there are also a lot of variables for Amazon Prime’s teams to consider, including:

  • Customer metrics
  • Feature rollout
  • Long-term planning
  • Cloud-based tools
  • New marketing strategies
  • Seasonalities, like days of the week, monthly, quarterly

Jalali’s team needs to take all those variables and determine an optimal future risk level — which, for Prime Video, he said, is somewhere between 90% and 95%. With that in mind, his team use “classical time series models to essentially narrow down the area for the forecast, and then use more advanced technologies, like deep learning, to really zoom into that area and find the exact numbers.”

He says this combination has worked for his team, but that it still has a way to go in terms of workload forecasting for a baseline and then scaling up. It gets harder when a live sports event streams at the same time as a big on-demand premiere.

“Resiliency is the intersection of complexity, scale and impact.”

— Olga Hall, director of technical programs, Amazon Prime Video

When Amazon Prime released the second season of the wildly popular Indian action series “Mirzapur,” the company was able to easily predict the time of a huge peak in viewership. Its teams then leveraged machine-learning models to predict the traffic spike when combined with any other live events and partner content releases.

“We need a predictive model that can tell us ahead of time what’s going to happen at that exact moment in time, so that we can prepare for it,” Jalali said.

With this in mind, Prime Video has built a library of past Amazon events to create a similarity engine, which engineers combine with data pouring in via social media and IMDB ratings, to predict hype. Then they test capacity and resiliency against that hype.

The teams can even automate regional considerations. For example, Jalali said that a lot of Indians are live streaming events from their phones. But since data is so expensive, they will gather together in free public Wi-Fi spots. So, his team has trained models to test under those conditions.

With all of this in place, the Prime Video teams are then able to automate the delegation of loads to different data centers based on availability and latency, making reactive decisions made in real-time, including over CPU and memory optimizations, allowing for auto-tuning and autoscaling.

This includes a new carbon footprint model that Jalali says factors in how the power is created and machine type.

Machine Learning to Reduce Incident Management

Geoffrey Robinson, principal technical program manager at Prime Video, also on the Chaos Carnival panel, compared incident management automation with adaptive cruise control, which has evolved from speed control all the way toward autonomy.

Robinson’s team is dedicated to answering questions like, “What are the things that engineers have to do multiple times and where can we automate that? How can we improve our process so that they can use their brainpower to solve more strategic needs?”

His team focuses on ways to reduce cognitive load at one of the most stressful times — when the pager calls.

One of his team’s objectives is to reduce time to mitigation. It uses data to uncover patterns for things like false alarms, allowing for easier error flagging and troubleshooting. This tool also highlights the likely culprits: any deployments made in the last 15 minutes.

Through machine learning, his team got the incident onboarding process down to five minutes, starting with a ticket-declaration service. With all the tagging, after resolution, the team feeds more live incident data back into the model, which then feeds into game-day simulations.

“Anything we can automate, like adaptive cruise control, we can feed back into that incident,” Robinson said. “So before the next incident occurs, we know that data will either be available to the team that’s troubleshooting or they’ve gone through the game days and they’ve tested to make sure that things are ready for it.”

This data-informed automation will likely decrease incidents over time, Hall said. “We see a future for all of us — for us as a team, for many of you in the audience — where an engineer sees a problem or issue only once,” she said. “This is a repeatable, controllable, understandable problem.”

But for now there’s still a human involved for those rare events or anomalies that fall outside machine-recognizable patterns. Just like we haven’t yet eliminated the need for humans to drive cars, Amazon Prime’s process isn’t automating the humans out of incident management — yet. It’s just trying to keep them from waking up unnecessarily at two in the morning.

Team Resilience Score

Prime Video colleague Sudeepa Prakash kicked off her Chaos Carnival talk by asking the live audience their reaction when an executive establishes a new scoring system. About three-quarters of the audience members admitted that their companies were in the process of new top-down measurements — although that doesn’t mean they were happy about it.

Prakash, a senior product manager at Prime Video, told the crowd how her company’s team resilience score tackled not just how it developed this mechanism to encourage teams’ preparedness to drive operational readiness. She gave tips for how to go about introducing new metrics-based concepts — without scaring everyone off.

The focus must be on teams aligning around achieving proactive reliability, which Prakash said is to “influence operational excellence through preparedness to avoid failures.” To address the unyielding complexity impact, they chose to anchor around the goal of availability because, as she said, “When you have a higher resilience, you’ll perhaps have a higher availability.”

The term resilience at Prime Video translates to:

  • Preparing to avoid failures.
  • Operating successfully in the presence of failures.
  • Accepting that failures are inevitable, so you need to have contingencies.

The company looked to turn this concept into a mechanism. But Prakash emphasized that building a tool was just a means to a specific end — alignment around availability.

This of course maps back to the Prime Video’s continuous improvement cycle of predict, prepare, operate and learn. Managers asked the resilience and chaos engineering team to look at existing engineering practices.

Amazon is really good at recording root cause analysis, Prakash said, so the company already had the data for what she called  “the more meaty components of the score,” which reflected prior issues.

The score components also all came from existing tools, including:

  • Deployment safety measures. What are different tools that you can deploy that aren’t going to cause issues?
  • Operational readiness review.
  • Unit testing and integration testing.
  • Root cause analysis. Specifically, the organization wanted to run checks for any open action items from a root cause analysis.

Reduce Cognitive Load -- a dark gray diamond with the number 83.40 in the middle and four points: operational readiness review, code coverage, CoE action items, and, finally, deployment safety, which is the area that shows less dark gray and needs the most work on, which is weighted as the most important.

The team resilience score brings together all of this data into one score per team, that everyone can see in one place, around four score goals, with the following weighting:

  • Deployment safety: 40%
  • Operational readiness review: 30%
  • Center of Excellence action items: 15%
  • Code coverage: 15%

The goal of this diamond-shaped scoring grid is to provide actionable insights, based on regular reporting, democratized by the teams — without increasing cognitive load. At a glance, teams can see where they stand, highlighting any missing components in light gray. This is where they first look to reduce repetitive actions and automate wherever possible.

“Visualization is important,” Prakash said. “By just looking at this, teams are able to make quick decisions of what they are going to prioritize.”

What feeds into each team resilience score is different and decided on at the team level. And the scores widely vary based on team complexity.

If a particular area is already optimized for operational excellence, the team moves on to a different area of improvement. Over time, Prakash said, the organization has learned that it needs to be flexible and constantly iterate on scores, getting continuous feedback from the teams. However, while scoring may change, it shouldn’t change frequently, and the goalposts shouldn’t be moved without a  clear reason.

A Resilience Score Can Help Set Priorities

Transparency is key to a successful implementation of a team resilience score, with team members being able to deep dive into any reasons for lost points. While the score is automated, each team has an override button to add notes or further data in case they think they have satisfied criteria.

Prakash emphasized that the team resilience score is a mechanism to help teams, paired with a tool to provide actionable insights, but it is “not a report card, not a means to mandate processes, not only for leaders — it is meant to be a tool for the teams.”

She warned to “treat these scores as proxies, not another evaluation or performance mechanism. It’s meant for the teams to prioritize what they should and should not be working on.”

And while it has the Prime Video teams aligning around resiliency, in true chicken-or-egg fashion, the company isn’t even sure if a high team resilience score correlates with high availability — or the other way around. It is, however, clear that team resilience is grounded in continuous improvement, so scores should continue to rise.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma, Unit.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.