Red Hat sponsored this post.
After updating to OpenShift 4.3.19, Quay.io experienced intermittent service interruptions. The team quickly rolled back to 4.3.18, restoring service and steadying the waters, but everyone involved was now taking part in a murder mystery.
You’ve heard stories, but if you’re lucky, you’ve never experienced it. The bug is below you. It’s above you. It’s in the walls. It’s listening to us right now.
Troubleshooting and debugging are time-honored traditions of the methodical and systematic elimination of possibilities. But what happens if you cannot rule out a portion of the stack because your team does not have deep knowledge of it? Or worse yet, what if one of the layers of your stack is closed source software?
What if, horror of horrors, your stack is entirely open source and the bug is down in one of those layers? In Kubernetes? In Linux? Can your teams even begin to comprehend tracking down that type of bug? Can they even eliminate it as a possibility without reading hundreds of pages of code and documentation?
Growing to 10-Digit Scale
Red Hat’s Quay.io is a very large hosted service. There’s been a lot of news recently about the business of hosting container images at scale for enterprise cloud users, and Quay.io has quietly been performing that function since 2013 and growing steadily. In the month of August 2020 alone, Quay.io served 1 billion container pulls and had 100% uptime.
Back in 2014, when Quay.io was acquired by CoreOS, a decision was made to build an App Registry into the service. This predated the modern methods of cloud native artifact bundling that we use today in Kubernetes, with solutions like OCI, but the functionality was nonetheless included into Quay’s codebase. Because this feature wasn’t what most users adopted Quay.io to do, it wasn’t highly used and so it didn’t get a lot of engineering scrutiny.
App Registry is a lesser-known feature of quay.io that allows objects like Helm charts and containers with rich metadata to be stored. While most quay.io customers don’t use this feature, Red Hat OpenShift is a large user. The OperatorHub within OpenShift uses App Registry to host all of its Operators.
Every OpenShift 4 cluster uses Operators from the embedded OperatorHub to serve a catalog of available Operators, to install and provide updates to already installed Operators. As OpenShift 4 adoption has increased, so has the number of clusters globally. Each one of those clusters needs to download Operator content to run the embedded OperatorHub, using the App Registry inside quay.io as a backend.
Fast forward to this summer and Quay.io is processing over one billion image requests per month, a rate of over 1.5 million per hour. It’s a large scale data distribution and retention service depended upon by enterprises around the globe. It’s also hosted on Red Hat OpenShift, an open hybrid cloud platform for container-based IT teams around the world.
After updating to OpenShift 4.3.19 from OpenShift 4.3.18, Quay.io’s database froze and the service stopped working, resulting in services that were intermittently disrupted. During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. The team quickly rolled back to 4.3.18, restoring service and steadying the waters, but everyone involved was now taking part in a murder mystery as their very own Inspector Lynley.
But the culprit has already been mentioned: the app registry. Turns out it had become the way internal teams at Red Hat were building Kubernetes Operators. The code behind app registry had never been pushed to work at this scale, and thus, the entire system suffered because of it.
We’re not here to discuss the end results: they’re almost boring compared to the giant bug hunt which ensued, and which shows just how CSI-style procedural such a search can get when Red Hat is involved.
Instead, we’re here to discuss that bug hunt. The twists and turns, the insane breadth of possibilities, and the methods used to track it down. The ensuing weeks after the crash saw Red Hat employees working on Quay, OpenShift, the Linux Kernel, and all manner of other systems, attempt to eliminate possibilities and identify the exact culprit.
William Dettelback is an engineering manager on the Quay engineering team. When it came to the Quay.io outage, the first thing he saw was the Red Hat SRE team, run by Jay Ferrandini and Jonathan Beakley, isolate the changes that had taken place between the service functioning properly and its newly degraded state.
Dettelback says it’s important to have this type of monitoring and performance measurement in place to start; otherwise, when things go sideways, you cannot actually tell. Without a baseline of system behavior, pinpointing when exactly the problem started is nigh impossible.
A Mile Wide, an Inch Deep
Fortunately, the number of changes across the systems involved were minimal. Unfortunately, they went deep. The OpenShift 4.3.18 to OpenShift 4.3.19 upgrade included not only OpenShift updates, but also some updates to the fundamental Linux systems and kernel used to power containers.
That’s because the OpenShift platform is not just some PaaS, or some framework, or even simply some implementation of Kubernetes. Instead, it is a harmonizing of thousands of open source projects, from the very bottom at the Linux kernel all the way up to the support for serverless applications running on top with Knative. Red Hat engineers have first-hand expertise across the entire open source stack.
In OpenShift 4, the Linux operating system is delivered as a feature of the platform through Red Hat Enterprise Linux CoreOS. Each instance of this OS is provisioned and updated by Kubernetes itself, using the Kubernetes declarative API machine controllers as part of the OpenShift installer. The entire stack embraces the concepts of fully immutable infrastructure.
Red Hat engineers were able to quickly narrow down what had changed in the kernel to just a few networking packages. It turned out, those were only a few commits worth of changes, but Bill said the team was able to learn this fact in a day — rather than spending their time researching the vagaries of the Linux kernel.
Stephen Cuppett, director of engineering for Red Hat OpenShift said that the Quay team, the OpenShift teams, and the Linux teams all tried to root out possible causes, narrowing the problem space as quickly as possible. But that wasn’t as easy as it sounded, as the problem only manifested at tremendous scale, making replication difficult in the lab.
Compounding matters, the Telemeter service, remote debugging data stream, had been experiencing network-based outages after the 4.3.18 to 4.3.19 update, so both the Quay.io and Telemeter teams were initially convinced that they were tracking down the same bug.
“As a macro-level service failure,” said Dettelback, “there were a lot of avenues to chase down. We had application things to chase down, infrastructure things to chase down, we had OpenShift things, then the RHEL side of things for this. We knew we had a small number of deltas we were dealing with. After quite a bit of investigation, we figured out that the telemeter issues were networking related, but [it was] not the same issue Quay saw.”
This is when proper logging of performance metrics became important. When the outage occurred, the clusters’ performance metrics were captured and saved using a synthetic benchmark test against a smaller version of Quay in the staging environment. Since the bug was nearly impossible to reproduce in the lab, this data would be a lifeline to figuring out the cause. The team couldn’t simply spin the updated version of Quay.io back up and wait for it to fail again, as that would interrupt services for users who had built critical systems based on Quay.
Thus, the data from the initial issue conditions was critical to troubleshooting. Said Dettelback, “We found Quay on OpenShift 4.3.18 versus 4.3.19 behaved very differently at that breaking point. That was the clue. We knew 4.3.19 wasn’t the smoking gun, but it was the thing we were concerned about. It didn’t explain why we went down, but we knew when we had to do the upgrade, [that] we had to be careful.”
The Usual Suspects
At first, the backup system was suspected as the cause, as the database had been running backup calls prior to the outage. That turned out not to be the case, however, closing entire avenues of possibility in the process, narrowing the list of suspects.
Unfortunately, the initial list of suspects was as long as one in an Agatha Christie novel. Cuppett said “We have different teams at all levels of the stacks, so none of my folks had to investigate all of them. It could have been a very protracted path. It’s complicated, this crosses skills boundaries. From Web services on Quay in Python, to Kubernetes in Go, to the Linux Kernel in C, and then there’s networking… These are all different teams that have multiple engineers at Red Hat.”
That means, said Cuppett, “We had gone wide across the different layers with multiple teams. That way, when one team found conclusive evidence, other teams could quickly abandon other costly and deep paths of investigation. And there were plenty of wrong roads to choose from, so narrowing it down across the many teams helped to prevent any one team from wasting their time, or blocking the other investigations. ”
In the end, the problem stemmed from the increased demand on the App Registry in Quay, a new feature that had never been tested at that scale, and was experiencing unexpected increased usage from development teams over time. That underlying App Registry code has since been optimized and the teams using those features are also being accommodated in other ways, reducing demand.
Said Dettelback, “The correct solution was multiple factors: it wasn’t one thing that took down Quay.io, it was a lot of traffic on a fairly vulnerable portion of Quay’s codebase that wasn’t designed to take the load it was taking. At a technical level, DNS resolution was slower on 4.3.19, but the way we determined that was that the team was able to build a reproducer in Python.”
The Abyss, Avoided
This could have been an endless dive into every avenue of possible issues. While the idea of possibly coming up against a bug that’s in the Linux kernel might sound like being knighted as a new open source warrior, is that really what your developers should be spending their time on if they’ve never touched the kernel before? This is the expertise of Red Hat engineers, and their work on issues like this is one of the advantages of Red Hat support. And if your teams really do encounter a kernel bug and want to take on the challenge of fixing it, we’ll help them do just that. We love bringing new contributors to open source!
“The kernel was one of those things where it looked very likely. ‘Oh, there was a kernel change! That could have had an upward effect on the stack.’ But we ruled that out very quickly,” said Dettelback.
So what’s the long-term solution? “I’d say the long-term solution is not the removal of app registry (that’s a tactical fix), but continuing to strengthen our cross-team collaboration across SRE, OCP and RHEL so we can fix these sorts of things faster. Because we have experts across the value chain and we work in an open manner, it’s easy to get the right people looking at the problem when you suspect it may be in their backyard. If we were a closed source shop or a less open organization, it would have been nearly impossible to get the collaboration and insight into chasing down what was going on when quay.io went down,” said Dettelback.
Feature image via Pixabay.