Netflix Has an Exchange So Complex That it Has Triggered a Scientific Renaissance
Internet-scale architecture is reawakening developers across many disciplines to a startling realization: The answer to multi-user, microservice architecture problems of the 2010s are often scaled-up versions of the same offerings conceived at mainframe scale as early as the 1960s.
Case in point: a mathematical concept applied to matrices called singular value decomposition (SVD), explained and depicted for its use in detecting physics anomalies in this 1976 film produced by Los Alamos National Labs. It’s a concept of applying rigorous algorithms to a matrix of values, in an effort to derive a single value that can be used as a threshold. The first step in identifying outliers — values in a time series that don’t appear to have been generated by the same inputs as most of the others — is establishing this threshold value.
What’s more difficult than using SVD to diagnose the behavior of subatomic particles in a reaction chamber? How about isolating inaccurate payment data in a worldwide system of payment transactions that could accrue as many as 13,000 records per second. Think I’m talking about a stock exchange?
No, I’m talking about Netflix, the wildly popular streaming video service.
The Web’s Killer App
More than half of the world’s Web traffic is marshaled by content delivery networks (CDN), according to Cisco’s Visual Networking Index. Some two-thirds of that amount looks to be generated by Netflix.
By even the most conservative accounts, Netflix is the single greatest consumer of bandwidth in the world.
Thus far, the Web hasn’t caved in or imploded. CDNs seem to be working as they were designed to. But here’s something you might never have considered, at least not in this depth: All Netflix traffic is being accounted for by financial transactions. A Netflix subscription is not just some gateway that opens up unlimited access to the world’s biggest video stream. The financial transactions for just that portion of Netflix’s 57 million worldwide subscriber base that happen to stream video each day, by the company’s estimate, generate over a billion separate records every day.
And that’s just for Netflix. With each financial transaction supported by at least three, often four, subsequent transactions between payment processing firms, banks, and their supporting banks, Netflix is arguably the most complex commercial exchange in history.
A payment system anomaly at Netflix scale is like a derailing train. When one of the service provider’s payment partners fails to transact with a customer, it subsequently fails to connect with hundreds more. The result is a measurable “churn” — a loss of customers that, for an ordinary department store, would be considered disastrous.
At a meetup produced by the San Francisco Bay Area chapter of the Association of Computing Machinery last January 26, Netflix lead data scientist Shankar Vedaraman explained how something deceptively characterized as an “anomaly” can lead to such troubles. He leads an analytics team that works directly with Netflix’s payment managers and financial planners whose mission is to reduce churn as much as possible.
“Basically, if a customer is not able to pay for Netflix and loses access to the Netflix service, we want to keep [churn] to a minimum. That’s our primary goal,” said Vedaraman. “We have a small set of folks working very closely with business users, and thereby are able to have a greater impact on the organization as a whole. But sometimes what happens is, we have a motivation for a solution for a specific vertical, but we take a step back and see if we can provide a scalable solution that can be used across different applications.”
In this case, the Netflix team came across a permutation of SVD originally conceived for the field of video surveillance — specifically, as a method of removing spurious shadows from captured faces to facilitate better facial recognition. It’s a concept called robust principal component analysis (RPCA), being driven by a Stanford University mathematics team led by Professor Emmanuel Candès.
The Blessing of Low Dimensionality
A 2009 research paper [PDF] explains the problem in general: “The recent explosion of massive amounts of high-dimensional data in science, engineering, and society presents a challenge as well as an opportunity to many areas such as image, video, multimedia processing, web relevancy data analysis, search, biomedical imaging and bioinformatics. In such application domains, data now routinely lie in thousands or even billions of dimensions, with a number of samples sometimes of the same order of magnitude.”
In statistics, dimensionality refers to the number of coordinates necessary to describe a data point. Applied to databases, you can think of this as the number of relevant properties necessary to identify a single point. Prof. Candès understood that data of the category he’s discussing “lie on some low-dimensional manifold,” meaning that it doesn’t need to be indexed by too many coordinates. Mathematicians can rely on this fact, he says, to generate matrices to which algorithms can be more robustly applied.
Even then, Candès noted that his more robust concept is flexible, suggesting that it could be applied to a certain video streaming service — specifically, to help it determine the movie tastes of its customers based on an incomplete set of customers’ movie reviews. Netflix’s own Chris Coburn, a member of its product analytics team, cites Candès’ work in deriving a form of RPCA that can not only be applied to Netflix’s payment records, but to other businesses as well.
The robust form of PCA greatly reduces the number of false positives. In a company blog post last Thursday, Coburn and his colleagues present an interactive demo of their real RPCA algorithm at work. Almost like the ’76 Los Alamos film, you can actually watch the algorithm generate a low-rank approximation of a seasonality wave, fitting itself as best it can to a series of oscillations — not unlike the ebbs and flows of Netflix customer viewing habits in a time series.
Once the low-rank estimated wave is generated, the points for which that wave does not account are considered the outliers. These outliers may then be assessed for potential payment processing events, with the analytics results projected onto a dashboard produced by Tableau. That dashboard can then be shared with payment processing partners, who aren’t necessarily data scientists, but who do need to see how customer events affect them as well.
What’s more, the behavior of these detected outliers may start to form a pattern, which could conceivably enable partners to detect payment system anomalies before they occur.
“Today Netflix customers sign up across the world on hundreds of different types of browsers or devices,” the team writes. “Identifying anomalies across unique combinations of country, browser/device and language helps our engineers understand and react to customer sign up problems in a timely manner.”
When monolithic software ruled the data center, processing errors were treated as necessary evils. But monoliths could only scale so far. Now that they’re being replaced with microservices, developers are reclaiming their roles as engineers and mathematicians, rediscovering the tools of the trade the way mainframe developers of the ‘70s had foreseen them.