Kubernetes

Intel’s Plan to Bring Deterministic Performance to Complex Server Workloads

9 May 2016 10:49am, by

The following story is the second in a two-part series exploring how Intel is helping to improve deterministic performance of complex systems, ranging from container-based cloud workloads to real-time telecommunications operations. Read the first part here.

It was March 31, 2016. In the basement of a San Francisco facility leased to tech companies (often the ones without their own offices) to host lunches and give presentations for their clients, partners, and customers, Intel had assembled a demonstration session for some of its customers, and a few members of the press interested enough to care what Intel would be talking about.

160331 This way to Cloud Day

For some, it was too beautiful a day to be seated in a basement on wooden folding chairs acquired from a fire sale. But for those who showed up, the subject was important enough. The topic was determinism, and why Intel would endow an even greater number of SKUs of its mainline Xeon server processor — now the E5 v4 — with a mechanism called cache allocation that promised to bring deterministic performance to even high-level workload orchestration.

Charged with the task of explaining perhaps the most esoteric element of processor design ever conceived to this mixed bag was a principal engineer in Intel’s Network Platforms Group named Edwin Verplanke.

“If you look at the communications diagrams, you might see this traffic that is growing at an astronomical rate,” said Verplanke. “But actually, the investments that comms suppliers and cloud suppliers need to make to keep up with that kind of data traffic are tremendous… It basically means that the investments they have to make, to keep up with that data demand, are bigger than the revenues they return from the data that goes over the networks.”

160331 Edwin Verplanke 01

In effect, the cost, measured in real dollars, of adding each new customer’s data traffic to a service provider’s network is more than adding the previous customer’s traffic. The reason is latency. In multi-tenant server clusters, new workloads tend to interfere with existing workloads. And when any customer’s workloads scale out, the interference generated just from their multiplicity, affects everyone.

Granted, Verplanke was speaking about service providers, which quantitatively represent a very small percentage of the total market for workload orchestration, in much the same way that New York City represents a small percentage of the U.S. population.

Reverberation

To be cost-effective, the hardware infrastructure that hosts workloads must use the same processor architecture needs to be capable of hosting all workload classes, including communications service providers, cloud service providers, and enterprises. But because each workload class makes different demands on that same architecture, the hardware must be adaptable. Virtualization cannot solve the problem in and of itself, because different workload classes behave differently, regardless of how they are being virtualized.

To this end, Intel had released its Resource Director Technology (RDT), which includes the cache allocation technique. Its goal is to enable orchestration tools such as Kubernetes and its commercial offshoot Tectonic (whose manufacturer, CoreOS, has officially partnered with Intel) to make adjustments as necessary to how the processor handles workloads, without forcing software developers to change those workloads.

Intel Xeon processors with the highest number of cores, Verplanke noted, could include as much as 40 MB of last-level cache (LLC, or in Xeon processors, L3) — the shared zone of memory fetched from slower DRAM to faster SRAM. Several years ago, Intel made it possible for every core in its processor to have equal access to the L3, which makes perfect sense if your goal is to run a single multi-threaded application on one processor.

But even though modern server processors contain as many as 20 cores, parallelism in the real world, said Verplanke, limits a VM to being exposed to only four cores. This fact practically guarantees that multiple VMs will be contending for that LLC on any one server. However, workloads may be orchestrated. In a worst-case scenario, a single application’s performance over time in a consolidated orchestration may be degraded by as much as 51 percent, his team calculated, over its performance when it was running all by itself.

160331 Edwin Verplanke 02

In the case of a compression algorithm used in benchmarking, called SPEC CPU2006 Bzip2, when 13 instances of the application containing the algorithm shared the server, occasionally and without warning, one instance could become as much as five times slower than another.

“It’s pretty obvious, as you start consolidating a large number of workloads,” he said, “you really need an architecture that deals with that.”

Over-fetching

To prove his point, Verplanke and fellow Intel research scientist Andrew Herdrich cited the results of tests conducted jointly with orchestration tools maker Appformix. Some of these same tests were later re-created by Appformix engineers for The New Stack. As we could see for ourselves, these tests verified that replicated containers can be, and probably typically already are, the worst offenders about generating latency.

[SCM]actwin,0,0,0,0;https://meetings.webex.com/collabs/meetings/landing?meetingID=MEHY99S857GI2X3UT1X6WRRHQ5-LRZH&language=en_US AppFormix's WebEx Meeting - Cisco WebEx Meetings - Mozilla Firefox firefox 4/4/2016 , 4:56:05 PM

As the upper left graph in the quadrant shows, when Appformix engineers spin up multiple copies of the same container on the same server, the L3 cache usage spikes, in accordance with a drag on the CPU in the lower right corner.

It’s a problem that absolutely cannot be solved by the developers of the software being replicated in containers. No matter how efficiently a containerized workload utilizes resources and plays nicely with the processor on its own, the latency and non-determinism problems begin when that workload is duplicated.

160331 Andrew Herdrich 01

Herdrich cited another example: video transcoding algorithms, which are used by motion picture editors at very large scale.  Theoretically, he suggested, monitoring tools could detect when these algorithms tend to drive up total system utilization. It’s these spikes in utilization that foretell when contention is about to happen. Such tools could react, he said, by scheduling other jobs with higher priorities instead, on those nodes where contention appears to be rising. This could happen even without the use of Intel’s Resource Director Technology, which would tag specific threads to monitor their contention levels since a scheduling system should be aware of when and how a single job causes utilization to spike.

Verplanke then asserted his own labs’ test results verifying his team’s observation of nearly two years earlier: that parts of the underlying Linux kernel, when replicated, misbehaved on the last-level cache.

“The other thing we found was, as a result of bringing up a large number of containers, you actually tend to over-fetch,” Verplanke continued. “So while containers become a disruptive force for each other, the other thing that happens is, you don’t have efficient use of the LLC as a result of the container fetching the same kind of infrastructure — [for example], if you’re calling a certain library that’s replicated over and over for different containers.”

The whole point of a memory cache in the first place is to enable pre-fetching — to copy a sizable block of address space into faster memory, in hopes that this block will be addressed several times before the next miss occurs. Determining which blocks to pre-fetch is something of an art form. Thinking logically, it’s easy to conclude that a multitude of different containers sharing one server through virtualization could force the LLC to pre-fetch vastly distributed blocks of address space far too often.

What monitoring tools such as the kind employed by Appformix tell us, though, is that a multitude of the same container can cause potentially worse problems by forcing the LLC to fetch, and re-fetch, and re-fetch again, the same blocks of address space. “By isolating the [amount] of the LLC that the containers get to see, we actually run more efficient,” said Verplanke. “If you pre-fetch too much, you’re not making efficient use of the last-level cache.”

Resource Direction

There are any number of very complex ways to explain how Intel Resource Director seeks to resolve this issue. Since we’re pressed for space and time, here’s a simpler explanation that just happens to make more sense:

In a perfect world, a processor manufacturer would be able to design, fabricate, and produce a completely separate set of chips for hosting containers and other virtualized workloads, as for high-availability supercomputing workloads. To keep costs low, Intel does not have that option. Had engineers foreseen the needs of containerized workloads six years earlier (Docker itself is less than four years old), they might have started the wheels turning for a class of server processor where last-level caches are all partitioned by design, and delegated one to each core.

Without such a hard partition, Intel needs a way to build temporary chain-link fences, if you will — barriers essentially made of software.

Now that containerization is both a reality and an industry, Intel is appealing to the leaders of that industry with Resource Director. Yet in unveiling Resource Director, from its digs at the bottom of the computing stack, Intel has shed light on a fact so large and so pervasive that those at the top of the stack have failed to notice it: This latency issue, where one copy of a process can run five times slower than another on the same processor, has been in front of our faces since Docker first thought a whale would make a cute logo. If we’re supposed to feel that steady performance improvement … we’re not.

Scale has been distorting our perception. From a developer’s perspective, the benefits of replicating processes and sticking load balancers behind them appeared to outweigh the costs. But at high scale, those costs could be measured in dollars. It took Intel, at the infrastructure level, at the segment of the stack we weren’t supposed to be concerned with, to tell us we were doing things wrong — a bit like Levi’s the first to catch us with our pants down.

Luckily for us, that message came with an olive branch, in the form of Resource Director.

But before we award ourselves once again for having dodged yet another bullet, we should heed the real message of Intel’s discovery — the message that Intel is too polite to say out loud. We’ve been relying on Moore’s Law like an automated service or as the force of gravity, or the ozone layer. It has not yet sunk in that we may need a plan for when it’s gone.

Intel is a sponsor of The New Stack.

Photos by Scott Fulton III.

A newsletter digest of the week’s most important stories & analyses.