TNS
VOXPOP
Favorite Social Media Timesink
When you take a break from work, where are you going?
Instagram/Facebook
0%
Discord/Slack
0%
LinkedIn
0%
Video clips on TikTok/YouTube
0%
X, Bluesky, Mastodon et al...
0%
Web surfing
0%
I do not get distracted by petty amusements
0%
Cloud Services / Microservices

Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All

The streaming service provider made waves when its engineers reported they had refactored their QoS monitor for a monolithic architecture. Microservices experts evaluating the details discovered they actually did just the opposite.
Jun 13th, 2023 6:00am by
Featued image for: Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All
Feature image of Foucault’s Pendulum from the National Museum of Science and Technology in Milan by Ben Ostrowsky, licensed under Creative Commons 2.0.

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many individuals, their messaging soon becomes unmanageable, and the organization stops growing.

Last March 22, in a blog post that went unnoticed for several weeks, Amazon Prime Video’s engineers reported the service quality monitoring application they had originally built to determine quality-of-service (QoS) levels for streaming videos — an application they built on a microservices platform — was failing, even at levels below 10 percent of service capacity.

What’s more, they had already applied a remedy: a solution their post described as “a monolith application.”

The change came at least five years after Prime Video — home of on-demand favorites such as “Game of Thrones” and “The Marvelous Mrs. Maisel” — successfully outbid traditional broadcast outlets for the live-streaming rights to carry NFL Thursday Night Football.

One of the leaders in on-demand streaming now found itself in the broadcasting business, serving an average 16.6 million real-time viewers simultaneously. To keep up with live sports viewers’ expectations of their “networks” — in this case, CBS, NBC, or Fox — Prime Video’s evolution needed to accelerate.

It wasn’t happening. When the 2022 football season kicked off last September, too many of Prime Video’s tweets were prefaced with the phrase, “We’re sorry for the inconvenience.

Prime Video engineers overcame these glitches, the engineers’ blog reported, by consolidating QoS monitoring operations that had been separated into isolated AWS Step Functions and Lambda functions, into a unified code module.

As initially reported, their results appeared to finally confirm many organizations’ suspicions, well-articulated over the last decade, that the costs incurred in maintaining system complexity and messaging overhead inevitably outweighed any benefits to be realized from having adopted microservices architecture.

Once that blog post awakened from its dormancy, several experts declared all of microservices architecture dead. “It’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system,” wrote Ruby on Rails creator David Heinemeier Hansson.  “Are we seeing a resurgence of the majestic monolith?” asked .NET MVP Milan Jovanović on Twitter. “I hope so.”

“That’s great news for Amazon because it will save a ton of money,” declared Jeff Delaney on his enormously popular YouTube channel Fireship, “but bad news for Amazon because it just lost a great revenue source.”

Yet there were other experts, including CodeOpinion.com’s Derek Comartin, who compared Prime’s “before” and “after” architectural diagrams with one another, and noticed some glaring disconnects between those diagrams and their accompanying narrative.

As world-class experts speaking with the New Stack also noticed, and as a high-ranking Amazon Web Services engineer finally confirmed for us, the solution Prime Video adopted not only fails to fit the profile of a monolithic application. In every respect that truly matters, including scalability and functionality, it is a more evolved microservice than what Prime Video had before.

That Dear Perfection

“This definitely isn’t a microservices-to-monolith story,” remarked Adrian Cockcroft, the former vice president of cloud architecture strategy at AWS, now an advisor for Nubank, in an interview with The New Stack. “It’s a Step Functions-to-microservices story. And I think one of the problems is the wrong labeling.”

Cockcroft, as many regular New Stack readers will be familiar, is one of microservices architecture’s originators, and certainly its most outspoken champion. He has not been directly involved with Prime Video or AWS since becoming an advisor, but he’s familiar with what actually happened there, and he was an AWS executive when Prime’s stream quality monitoring project began. He described for us a kind of prototyping strategy where an organization utilizes AWS Step Functions, coupled with serverless orchestration, for visually modeling business processes.

With this adoption strategy, an architect can reorganize digital processes essentially at will, eventually discovering their best alignment with business processes. He’s intimately familiar with this methodology because it’s part of AWS’ best practices — advice which he himself co-authored. Speaking with us, Cockcroft praised the Prime Video team for having followed that advice.

As Cockcroft understands it, Step Functions was never intended to run processes at the scale of live NFL sports events. It’s not a staging system for processes whose eventual, production-ready state would need to become more algorithmic, more efficient, more consolidated. So the trick to making the Step Functions model workable for more than just prototyping is not just to make the model somewhat scalable, but also transitional.

“If you know you’re going to eventually do it at some scale,” said Cockcroft, “you may build it differently in the first place. So the question is, do you know how to do the thing, and do you know the scale you’re going to run it at? Those are two separate cases. If you don’t know either of those, or if you know it’s small-scale, complex, and you’re not exactly sure how it’s going to be built, then you want to build a prototype that’s going to be very fast to build.”

However, he suggested, if an organization knows from the outset its application will be very widely deployed and highly scalable, it should optimize for that situation by investing in more development time up-front. The Prime Video team did not have that luxury. In that case, Cockcroft said, the team was following best practices: building the best system they could, to accomplish the business objectives as they interpreted them at the time.

“A lot of workloads cost more to build than to run,” Cockcroft explained. “[For] a lot of internal corporate IT workloads, lots of things that are relatively small-scale, if you’re spending more on the developers than you are on the execution, then you want to optimize for saving developer time by building it super-quickly. And I think the first version… was optimized that way; it wasn’t intended to run at scale.”

As any Step Functions-based system becomes refined, according to those same best practices, the next stage of its evolution will be transitional. Part of that metamorphosis may involve, contrary to popular notions, service consolidation. Despite how Prime Video’s blog post described it, the result of consolidation is not a monolith. It’s now a fully-fledged microservice, capable of delivering those 90% cost reductions engineers touted.

“This is an independently scalable chunk of the overall Prime Video workload,” described Cockcroft. “If they’re not running a live stream at the moment, it would scale down or turn off — which is one reason to build it with Step Functions and Lambda functions to start with. And if there’s a live stream running, it scales up. That’s a microservice. The rest of Prime Video scales independently.”

Following the publication of this article, an AWS spokesperson contacted The New Stack offering further advice on how Step Functions may be put to use within organizations. Many AWS customers, including Liberty Mutual and Taco Bell, the spokesperson told us, begin their architectural plan with Step Functions, and have chosen to stick with it as their deployments scale up and out. The Prime Video stream QoS service that was the topic of the original Prime blog post, the spokesperson asserted, is one of many services the streamer utilizes on the AWS platform, and many of those others may continue to use Step Functions for the foreseeable future.

The New Stack spoke with Ajay Nair, AWS’ general manager for Lambda and for its managed container service App Runner. Nair confirmed Cockcroft’s account in its entirety for how the project was initially framed in Step Functions, as well as how it ended up a scalable microservice.

Nair outlined for us a typical microservices development pattern. Here, the original application’s business processes may be too rigidly coupled together to allow for evolution and adaptation. So they’re decoupled and isolated. This decomposition enables developers to define the contracts that spell out each service’s expected inputs and outputs, requirements and outcomes. For the first time, business teams can directly observe the transactional activities that, in the application’s prior incarnations, had been entirely obscured by its complexity and unintended design constraints.

From there, Nair went on, software engineers may codify the isolated serverless functions as services. In so doing, they may further decompose some services — as AWS did for Amazon S3, which is now served by over 300 microservice classes. They may also consolidate other services. One possible reason: Observing their behavior may reveal they actually did not need to be scaled independently after all.

“It is a natural evolution of any architecture where services that are built get consolidated and redistributed,” said Nair. “The resulting capability still has a well-established contract, [and] has a single team managing and deploying it. So it technically meets the definition of a microservice.”

Breakdown

“I think the definition of a microservice is not necessarily crisp,” stated Brendan Burns, the co-creator of Kubernetes, now corporate vice president at Microsoft, in a note to The New Stack.

“I tend to think of it more in terms of capabilities around functionality, scaling, and team size,” Burns continued. “A microservice should be a consistent function or functions — this is like good object-oriented design. If your microservice is the CatAndDog() service, you might want to consider breaking that into Cat() and Dog() services. But if your microservice is ThatOneCatOnMyBlock(), it might be a sign that you have broken things down too far.”

“The level of granularity that you decompose to,” explained F5 Networks Distinguished Engineer Lori MacVittie, speaking with The New Stack, “is still limited by the laws of physics, by network speed, by how much [code] you’re actually wrapping around. Could you do it? Could you do everything as functions inside a containerized environment, and make it work? Yes. It’d be slow as heck. People would not use it.”

Adrian Cockcroft advises that the interpretability of each service’s core purpose, even by a non-developer, should be a tenet of microservice architecture itself. That fact alone should mitigate against poor design choices.

“It should be simple enough for one person to understand how it works,” Cockcroft advocated. “There are lots of definitions of microservices, but basically, you’ve partitioned your problem into multiple, independent chunks that are scaled independently.”

“Everything we’re describing,” remarked F5’s MacVittie, “is just SOA without the standards… We’re doing the same thing; it’s the same pattern. You can take a look at the frameworks, objects, and hierarchies, and you’d be like, ‘This is not that much different than what we’ve been doing since we started this.’ We can argue about that. Who wins? Does it matter? Is Amazon going to say, ‘You’re right, that’s a big microservice, thank you?’ Does it change anything? No. They have solved a problem that they had, by changing how they design things. If they happen to stumble on what they should have been doing in the first place, according to the experts on the Internet, great. It worked for them. They’re saving money, and they did expose one of those problems with decomposing something too far, on a set of networks on the Internet that is not designed to handle it yet.

“We are kinda stuck by physics, right?” she continued.  “We’re unlikely to get any faster than we are right now, so we have to work around that.”

Perhaps you’ve noticed: Enterprise technology stories thrive on dichotomy. For any software architecture to be introduced to the reader as something of value, vendors and journalists frame it in opposition to some other architecture. When an equivalent system or methodology doesn’t yet exist, the new architecture may end up being portrayed as the harbinger of a revolution that overturns tradition.

One reason may be because the discussion online is being led either by vendors, or by journalists who tend to speak with vendors first.

“There is this ongoing disconnect between how software companies operate, and how the rest of the world operates,” remarked Platify Insights analyst Donnie Berkholz. “In a software company, you’ve got ten times the staffing and software engineering on a per capita basis across the company, as you do in many other companies. That gives you a lot of capacity and talent to do things that other people can’t keep up with.”

Maybe the big blazing “Amazon” brand obscured the fact — despite the business units’ proximity to one another — that Prime Video was a customer of AWS. With its engineers’ blog post, Prime joined an ongoing narrative that may have already spun out of control. Certain writers may have focused so intently upon selected facets of microservices architecture, that they let readers draw their own conclusions about what the alternatives to that architecture must look like. If microservices were, by definition, small (an aspect that one journalist in particular was guilty as hell of over-emphasizing), its evil counterpart must be big, or bigness itself.

Subsequently, in a similar confusion of scale, if Amazon Prime Video embraces a monolith, so must all of Amazon. Score one come-from-behind touchdown for monoliths in the fourth quarter, and cue the Thursday Night Football theme.

“We’ve seen the same thing happening over and over across the years,” mentioned Berkholz. “The leading-edge software companies, web companies, and startups encounter a problem because they’re operating at a different scale than most other companies. And a few years later, that problem starts to hit the masses.”

Buildup

The original “axis of evil” in the service-orientation dichotomy was 1999’s Big Ball of Mud. First put forth by Professors Brian Foote and Joseph Yoder of the University of Illinois at Urbana-Champaign, the Big Ball helped catalyze a resurgence in support for distributed systems architecture. It was seated at the discussion table where the monolith sits now, but not for the same reasons.

The Big Ball wasn’t a daunting tower of rigid, inflexible, tightly-coupled processes, but rather programs haphazardly heaped onto other programs, with data exchanged between them by means of file dumps onto floppy disks carried down office staircases in cardboard boxes. Amid the digital chaos of the 1990s and early 2000s, anything definable as not a Big Ball of Mud, was already halfway beautiful.

“Service Oriented Architecture was actually the same idea as microservices,” recalls Forrester senior analyst David Mooter. “The idea was, you create services that align with your business capabilities and your business operating model. Most organizations, what they heard was, ‘Just put stuff [places] and do a Web service,’ [the result being] you just make things SOAP. And when you create haphazard SOAP, you create Distributed Little Balls of Mud. SOA got a bad name because everyone was employing SOA worst practices.”

Mooter shared some of his latest opinions in a Forrester blog post entitled, “The Death of Microservices?” In an interview with us, he noted, “I think you’re seeing, with some of the reaction to this Amazon blog, when you do microservices worst practices, and you blame microservices rather than your poor architectural decisions, then everyone says microservices stink… Put aside microservices: Any buzzword tech trend cannot compensate for poor architectural decisions.”

The sheer fact that “Big Ball” is a nebulous, plastic metaphor has enabled almost any methodology or architecture that fell out of favor over the past quarter-century, to become associated with it. When microservices makes inroads with organizations, it’s the monolith that gets to wear the crown of thorns. More recently, with some clever phraseology, microservices has carried the moniker of shame.

“Our industry swings like a pendulum between innovation, experimentation, and growth (sometimes just called ‘peacetime’) and belt-tightening and pushing for efficiency (‘wartime’),” stated Laura Tacho, long-time friend of The New Stack, and a professional engineering coach.  “Of course, most companies have both scenarios going on in different pockets, but it’s obvious that we’re in a period of belt-tightening now. This is when some of those choices — for example, breaking things into microservices — can no longer be justified against the efficiency losses.”

Berkholz has been observing the same trend: “There’s been this push back-and-forth within the industry — some sort of a pendulum happening, from monolith to microservices and back again. Years ago, it was SOA and back again.”

Defenders of microservices against the mud-throwing that happens when the pendulum swings back, say their architecture won’t be right for every case, or even every organization. That’s a problem. Whenever a market is perceived as being served by two or more equivalent, competing solutions, that market may correctly be portrayed as fragmented. Which is exactly the kind of market enterprises typically avoid participating in.

“Fragmentation implies that the problem hasn’t been well-solved for everybody yet,” Berkholz told us, “when there’s a lot of different solutions, and nobody’s consolidated on a single one that makes sense most of the time. That is something that companies watch. Is this a fragmented ecosystem, where it’s hard to make choices? Or is this an ecosystem where there’s a clear and obvious master?”

From time to time, Lori MacVittie told us, F5 Networks surveys its clients, asking them for the relative percentages of their applications portfolios they would describe as monoliths, microservices, mobile apps and middleware-infused client/server apps.  “Most organizations were operating at some percentage of each of those,” she told us. When the question was adjusted, asking only whether their apps were “traditional” or “modern,” the split usually has been 60/40, respectively.

“They’re doing both,” she said. “And within those, they’re doing different styles. Is that a mess? I don’t think so. They had specific uses for them.”

“I kind of feel like microservice-vs.-monolith isn’t a great argument,” stated Microsoft’s Brendan Burns. “It’s like arguing about vectors vs. linked lists or garbage collection vs. memory management. These designs are all tools — what’s important is to understand the value that you get from each, and when you can take advantage of that value. If you insist on microservicing everything, you’re definitely going to microservice some monoliths that probably you should have just left alone. But if you say, ‘We don’t do microservices,’ you’re probably leaving some agility, reliability and efficiency on the table.”

The Big Ball of Mud metaphor’s creators cited, as the reason software architectures become bloated and unwieldy, Conway’s Law: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” Advocates of microservices over the years have taken this notion a few steps further, suggesting business structures and even org charts should be deliberately remodeled to align with software, systems, and services.

When the proverbial pendulum swings back, notes Tacho, companies start reconsidering this notion. “Perhaps it’s not only Conway’s Law coming home to roost,” she told us, “but also, ‘Do market conditions allow us to take a gamble on ignoring Conway’s Law for the time being, so we could trade efficiency for innovation?’”

Continuing her war-and-peace metaphor, Tacho went on: “Everything’s a tradeoff. Past decisions to potentially slow development down and make processes less efficient due to microservices might have been totally fine during peacetime, but having to continuously justify those inefficiencies, especially during a period of belt-tightening, is tiresome. What surprises me sometimes is that rearchitecting a large codebase is not something that most companies would invest in during wartime. They simply have to have other priorities with a better ROI for the business, but big fish like Amazon have more flexibility.”

“The first thing you should look at is your business,” advised Forrester’s Mooter, “and what is the right architecture for that? Don’t start with microservices. Start with, what are the business outcomes you’re trying to achieve? What Forrester calls, ‘Outcome-Driven Architecture.’ How do we align our IT systems and infrastructure and applications, to optimize your ability to deliver that? It will change over time.”

“It’s definitely the case,” remarked Microsoft’s Burns, “that one of the benefits of microservices design is that it enables small teams to behave autonomously because they own very specific APIs with crisp contracts between teams. If the rest of your development culture prevents your small teams from operating autonomously, then you’re never going to gain the agility benefits of microservices. Of course, there are other benefits too, like increased resiliency and potentially improved efficiency from more optimal scaling. It’s not an all-or-nothing, but it’s also the case that an engineering culture that is structured for independence and autonomy is going to do better when implementing microservices. I don’t think that this is that much different than the cultural changes that were associated with the DevOps movement a decade ago.”

Prime Video made a huge business gamble on NFL football rights, and the jury is still out as to whether, over time, that gamble will pay off. That move lit a fire under certain sensitive regions of Prime Video’s engineering team. The capabilities they may have planned to deliver three to five years hence, were suddenly needed now. So they made an architectural shift — perhaps the one they’d planned on anyway, or maybe an adaptation. Did they enable business flexibility down the road, as their best practices advised? Or have they just tied Prime Video down to a service contract, to which their business will be forced to adapt forever? Viewed from that perspective, one could easily forget which option was the monolith, and which was the microservice.

It’s a dilemma we put to AWS’ Ajay Nair, and his response bears close scrutiny, not just by software engineers: “Building an evolvable architectural software system is a strategy, not a religion.”

Update: Since publication, this story has been updated with additional material from AWS around Step Functions.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.