TNS
VOXPOP
Will JavaScript type annotations kill TypeScript?
The creators of Svelte and Turbo 8 both dropped TS recently saying that "it's not worth it".
Yes: If JavaScript gets type annotations then there's no reason for TypeScript to exist.
0%
No: TypeScript remains the best language for structuring large enterprise applications.
0%
TBD: The existing user base and its corpensource owner means that TypeScript isn’t likely to reach EOL without a putting up a fight.
0%
I hope they both die. I mean, if you really need strong types in the browser then you could leverage WASM and use a real programming language.
0%
I don’t know and I don’t care.
0%
Operations / Software Testing

Netflix’s Testing Strategies for Migrating to GraphQL

Netflix achieved a zero downtime migration for their mobile apps. Here's how they did it!
Aug 2nd, 2023 8:53am by
Featued image for: Netflix’s Testing Strategies for Migrating to GraphQL
Feature image by Mohamed Nuzrath from Pixabay.

Netflix achieved a zero downtime migration for their mobile apps. Here’s how they did it!

The popular content streaming service migrated its mobile apps to GraphQL last year. It completed this major transition from its internal API framework, Falcor, with zero downtime. Netflix’s engineering team completed this huge undertaking, safely, for hundreds of millions of customers without disruption, by some good ole testing.

A recent blog post detailed the testing strategies Netflix employed during its mobile application migration. The engineers tested for either idempotency or functionality. Netflix used Replay and Sticky Canary testing to keep its application working during the challenging engineering work.

Migration in a Nutshell

Before migrating to GraphQL, Netflix’s API layer consisted of a monolithic server built with Falcor.

The engineering team chose to build a shim rather than take this complete migration on at once. The GraphQL shim provided client engineers the ability to move to GraphQL quickly, work out client-side concerns such as cache normalization, experiment with different GraphQL clients, and investigate client performance without being blocked by server-side migrations (diagram below).

Netflix engineers wanted to move away from Falcor as soon as they could so they leaned into Federated GraphQL which powers a single GraphQL API with multiple GraphQL servers. They could swap out the implementation of a field from GraphQL Shim to the new Video API Service with federation directives (diagram below).

Testing Strategies

Netflix engineers used two key factors to determine their testing strategies:

  • Functional vs non-functional requirements Are the control group and experimental group experiencing feature parity? For this Netflix relied on AB Testing (outside the scope of this article but included in their blog post) and a more intensified member of the AB Testing family, Sticky Canaries.
  • Idempotency This same query will output the same results over and over again. Replay Testing (yes, that same replay) tested idempotent fields.

Replay Testing — Validation at Scale

Phase one was the introduction of the shim and phase two brought with it the introduction of a reimagined (think less tech debt, less buggy) version of Netflix’s Falcor API, now dubbed the Video API Service. The Video API Service had to be identical to the Shim service. Netflix relied on Replay Testing to verify that the idempotent APIs were migrated correctly from the GraphQL Shim to the Video Service API.

The Replay Testing framework leverages the @override directive available in GraphQL Federation. By default, GraphQL queries resolve to the older schema. In this case, it would be the Shim Service.  The image below illustrates a pink @override just beside the certificationRating field (Rated R, PG-13). This @override directs the GraphQL Gateway to send the certificationRating resolution to the Video API Service.

The Replay Tester tools samples raw traffic streams from Mantis. The tool can capture a live request from production and run an identical GraphQL query against both the GraphQL Shim and the new Video API Service. The tool compares the results and outputs any differences in response payloads.

Engineers can view diffs as flattened JSON nodes. The left side of the comma in the parenthesis is the control value and the experiment value is on the right.

The top diff illustrates missing data for an ID field and the second is an encoding difference. Replay tester gave Netflix confidence in replicated business logic where subscriber plans and location data determined the catalog availability.

Wins

  • Replay testing provided confidence that both GraphQL implementations are equal.
  • The test enabled tuning configs in cases where data was missing due to overeager timeouts.
  • It tested business logic that required many (unknown) inputs and where correctness can be hard to eyeball.

Gotchas

  • Replay Testing isn’t effective for personally identifiable information or non-idempotent APIs. It’s beneficial to have prevention mechanisms in place so those categories aren’t accidentally tested.
  • Correctness: What exactly is correctness? Is it more correct for an array to be empty or null or is it just noise? Ultimately, Netflix matched existing behavior as much as possible because verifying the robustness of the client’s error handling was difficult.
  • You can’t test what you forget about. Netflix ended up with untested fields simply because they feature developers about them.

Shortcomings, yes. But Netflix credits Replay Testing as a key indicator of when they achieved functional correctness of most idempotent queries.

Sticky Canary

Sticky Canary is similar to a covert AB Testing experiment. It’s an infrastructure experiment where customers are assigned either to a canary aka experimental or a baseline host for the duration of the experiment. The goal is to discover the overall perceived health of user interaction for both the baseline and experimental hosts. Are users clicking play at the same rates? Are videos loading before the user loses interest? The Sticky Canary testing took place in phase two making the baseline the GraphQL Shim and the canary the Video API Service.

Then they collect and analyze the performance of the two clusters. Some KPIs Netflix closely monitors are:

Some of the metrics collected were:

  • Median and tail latencies
  • Error rates
  • Logs
  • Resource utilization — CPU, network traffic, memory, disk

They started small. Tiny customer allocations, hour-long experiments. After validating performance, they slowly built up scope. The larger scope included a higher percentage of customer allocations, multi-region tests, and eventually day-long experiments. Validating along the way is essential since Sticky Canaries impact live production traffic and are assigned persistently to a customer.

Several Sticky Canaries later and the metrics for phase two of the migration were high enough that Netflix could dial up GraphQL globally.

Wins

  • Sticky Canary provided insight into latencies and resource utilization. This helped Netflix understand the changes in scaling profiles post-migration.
  • This test confirmed improved metrics after the switch to GraphQL.
  • Sticky Canary is compatible with mutating or non-idempotent APIs.

Gotchas

  • There’s potential for negative customer impact since Sticky Canaries impact real users.
  • Sticky Canaries are meant as short-lived experiments. Netflix recommends AB tests for longer-lived experiments.

Conclusion

The original article said it best, “Technology is constantly changing, and we, as engineers, spend a large part of our careers performing migrations. The question is not whether we are migrating but whether we are migrating safely, with zero downtime, in a timely manner.” It seems like this time, the answer for Netflix is yes.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.