IBM OpenStack Engineer Urges Augmenting Jenkins with Zuul for Hyperscale Projects

When you’re managing a hyperscale project with thousands of contributors across dozens of nations, and with more than a handful of active, independently developed components, how do you expect to reliably pull off the feat of managing the project’s evolution as a single stream? Of course, you might be asking yourself, just how many real-world projects fit that description besides OpenStack itself? Yet you may recall there was a time not too many months ago when Kubernetes was being described as a product of hyperscale architecture. Perhaps this morning’s hyperscale advances are destined to be this evening’s mainstream commodities.
During the regular Technical Oversight Committee meeting of the Cloud Native Computing Foundation on Wednesday morning, Clint Byrum, a cloud architect with IBM (until September 2015 with HPE) and a contributor to OpenStack, was invited to present his case for how the CNCF could improve its continuous integration processes using a system of automated gatekeepers.
Such a system would prevent CI pipelines from being opened, and merge processes from taking place, until specific conditions have been met. But while the gates are closed, it could actually enable some testing operations to proceed in advance of schedule, while withholding any merges resulting from those tests until the gates are open.
How We Do Things Downtown
It’s a system Byrum claims has improved the way OpenStack commits changes to its many components upstream — and has been doing so, quietly (which is a surprise, given its name) since 2012. It’s called Zuul, a project named for the “gatekeeper” ghoul that appropriates Sigourney Weaver’s body in the original “Ghostbusters” (In this business, fun is where you make it.)
“We have a desire to test components that are integrated together,” Byrum told attendees. “So if it’s not a total island unto itself — which most things aren’t, and we want to make sure that they actually work together, not just that their APIs work the way they expect — we do a full integration test in every commit of every major project of OpenStack, where they’re integrated together pre-merge. We don’t land code unless it passes these tests.”
Zuul’s role in OpenStack is to provide a service called trunk gating (not to be confused with tailgating). This system puts new code submissions through the same set of paces each time, starting with being locally tested in a virtual environment, then submitted to OpenStack’s Gerrit repository for review, and being updated with a series of automated patches. Once that code reaches the acceptance phase, it becomes subject to another battery of pre-merge checks. And once merged, it remains subject to post-merge analysis.
It’s a kind of superstructure made up of Jenkins pipelines. As Byrum admitted to the CNCF, it was not a piece of cake to construct.
Hedging at Hyperscale
“Jenkins architecture had a fatal flaw, which is that we wanted to gate. But in order to gate with Jenkins, it doesn’t have any sort of way to optimize that, other than submitting giant amounts of merge commits or monorepo,” Byrum said, referring to the alternative approach of housing multiple versions of interrelated code components in a single repository. He noted his team did investigate that alternative, but in the end declined to assume its burdens in place of the ones they had.
Maintaining independent repositories, he suggested, made it easier for his team to devise a trunk gating system. In their particular case, they created a unique system of ascertainment that makes science fiction seem antiquated: speculative future merge testing.
Byrum credited OpenStack core contributor James E. Blair with the creation of the algorithm behind Zuul’s testing engine, whose inspiration, in turn, Blair credited in 2013 to speculative multithreading techniques used by CPUs. The concept there came to fruition in about 1998 and was invented as a means for processors to expedite parallel processing, especially in situations where multiple threads shared the same data.
It’s literally the principle of jumping ahead. When a core has an opportunity, it executes what appear to be instructions at the tail of one thread. For the version Blair cited, if those speculative instructions intended to make changes to main memory, then the contents of that memory were first mirrored in a cache, so the speculative changes may be made there. If the thread ends up having been executed prematurely (if the speculation was wrong), then the cache contents are wiped and the previous thread state restored. That incurs a few cycles of penalty, but cumulatively, the total penalties should be less than the bottlenecks a CPU encounters when it tries to execute every thread in perfect sequence.
Blair borrowed this idea as a way of allowing Gerrit to trigger merge testing to proceed for each component, even before the point in the pipelines when we know it’s actually necessary. In continuous integration, multiple components have to be tested in sequence, and the hierarchy of those tests is determined by those components’ dependencies upon one another. Components that pass all tests may be merged with the production stream.
But when a test for a prospective change does fail, Zuul’s scheduler responds by removing that change from the sequence of tests — while at the same time maintaining the change’s position in the merge queue. It doesn’t alter the queue; it merely relies upon the gatekeeper to ensure that untested changes are never merged. Subsequently, tests earlier in the sequence that relied upon the failed change, are restarted.
All the while, the gatekeeper maintains an internal pointer called the nearest non-failing item (NNFI). This becomes the focal point for the testing sequence. But it also creates a launch point for a speculative outcome, as Byrum described it — a way for Zuul to jump ahead and work on a prospective landing order for new pull requests.
“Essentially, what [Zuul] does is look at all the events it has been told about, including approvals of [pull requests], and it says, ‘I’m going to build a future which has them landing in this order. If they’re not expressed dependencies of each other — if they’re not in the same repo and stacked on each other — then it makes sure they can all merge together, and builds a big, long pipeline of Change 1, then Change 1 + 2, then Change 1 + 2 + 3, and so on. Then it tests each combination in parallel using an elastic cloud.”
As a result, he told the CNCF, for a big number of simultaneously converging pull requests — say, 25 — Zuul creates a window of possibilities for merge order as narrow as 1. Using NNFI, Zuul can re-organize this sequence and execute a gate reset should any of the merge commits in the sequence cause a problem.
“In the best case, no matter how much work you have backed up and approved to land, you take one testing window,” said Byrum. “In the worst case, you get all the patches to know that all of them failed, but you’ve landed no bad code.”
There Is No Dana
Of the several virtues accredited to OpenStack over the years, one of them has not been the steady and relentless pace of its evolution. Byrum acknowledged that other open source community projects have experimented with version 2.x of Zuul, with poor results. In the same spirit of humility, he admitted that Zuul’s evolution up to the current version has diverted off its main course here and there, to account for the specific and exclusive needs of OpenStack — thus exposing some of that platform’s implementation details, perhaps unnecessarily.
But a “long, large-scale refactoring” of Zuul has been in the works for years, he told the CNCF, culminating in what he hopes to be a version 3.x of Zuul by June at the latest. That new version will include a more agnostic definition language, and he advised that CNCF holds off on adopting Zuul until the dependability of the API based on that language can be guaranteed.
In the meantime, however, he suggested that developers interested in experimenting with gatekeeping tryout BonnyCI, which he described as “Zuul-as-a-Service.” As a cloud-based environment, BonnyCI may not yet be ready for the scale of a CNCF-hosted project — for example, he said, its dependencies outside of its own cloud may be limited to GitHub. However, it could conceivably demonstrate the gatekeeping and speculative future principles in action.
The topic of Zuul’s portability speaks to the broader issue of whether algorithms can be trusted to expedite automation on a macro scale, the same way they expedite code execution on a micro scale. Can we think of the process of cloud-based software development using registers, fetches, pre-fetches, and caches as though we lived and worked in a massive CPU? The cultural aspects may seem daunting, but the prospect for productive rewards is undeniable. (There, and you thought I’d end this piece with another Ghostbusters reference, like a warning about crossing the streams or something.)
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Title image of a gatekeeper gargoyle outside Magdalen College in Oxford, by Chris Creagh, licensed under Creative Commons.