Kubernetes

How Intel is Working to Improve Deterministic Performance for Complex Workloads

6 May 2016 8:26am, by

The following story is the first in a two-part series exploring how Intel is helping to improve deterministic performance of complex systems, ranging from container-based cloud workloads to real-time telecommunications operations. The second part will run on Monday.

Carefully partitioning each level of the computing stack (including the operating system, the network, the Internet) were layers of abstraction, devised intentionally so that a developer on one level would not need to be concerned about the operating specifications of all the other levels. What has made Web development work is its independence from browsers, operating systems, and even the Internet itself; what has made virtualization work — up to now, at least — is the independence and isolation VMs and containers have maintained from the operating systems that host them.

These partitions help developers and engineers at each level focus on the tasks in front of them. But they also shield these people from the reality system builders now face: What happens at the highest levels of the stack sends ripple effects down to the very bottom, affecting the processor itself.

These ripples are especially felt by communications providers, such as telcos and cable companies, whose watermarks for security and availability are far higher than for the typical enterprise data center — even one at the scale of Google or Facebook.

“One of the key requirements within the comms environment is real-time performance determinism,” explained Edwin Verplanke, a  principal engineer in Intel’s Network Platforms Group named . “If you have a packet stream coming in, you don’t want to drop any packets, because the problem is, if you have a security appliance that drops packets, it’s not quite secure.”140604 Edwin Verplanke 01

Telcos and broadband content providers require determinism, perhaps more so than any other single customer category. The quality of determinism is, at it simplest,  the guarantee that a process will work within a performance envelope.

Because telco workloads are orchestrated in real-time and their traffic is processed in real-time, they need for each execution cycle to be the same as the previous one. Drop a single frame of data from even a minor lull in performance, and channel encryption is no longer viable. Encryption is not only a necessity for security but as a way to produce virtual channels in a system that flushes billions of packets downstream, out-of-order.

Pachinko

CPUs are brilliant marble machines. For the simplest ones ever produced, their logic could conceivably be mimicked with ball bearings, gears, and levers. It’s wrong to say Intel CPUs are enormously complex — not if you understand how chip logic works. They’re just huge, in a very compressed way. The many ways processors expedite their processes can be spellbinding to watch when simulated. But they can also become unpredictable.

This is the problem Verplanke and his fellow Intel engineers faced. When the logic unit or core fetches data, it doesn’t reach directly into the address space of DRAM, but instead into one of about three caches. Here, the contents of contiguous regions of DRAM space are temporarily duplicated. The biggest cache in a CPU is the last-level cache, which in Intel Xeon processors is the L3. Unlike the other caches, L3 is designed to be shared among all the cores in the CPU. This means several threads will hit this cache at the same time.

Since a core’s next memory fetch operation is more likely to target the spot in memory adjacent to the previous one, the L3 cache can supply the core with data much faster than DRAM can, but only until the pointer runs off the edge of the cache. That generates a “miss,” which triggers a new operation that refreshes the cache. Such refreshes consume time. A process is said to be more deterministic when it generates fewer cache misses.

When a cache “belongs” to a core, as with the L1, even cache misses can be accounted for, and determinism can be preserved. But with the L3, which is a shared playground among the various threads (in a hyperthreaded system, there may be two threads to each one core), they don’t always play nicely with one another.

Telcos — one of Intel’s most lucrative customer groups — were reporting a sharp downturn in the performance of deterministic processes, particularly when they were being multi-threaded.

When Edwin Verplanke and his colleagues sought out the culprit, they didn’t have to look far to find it. It was the Linux kernel.

“Our monitoring feature allows us to establish which applications are behaving nicely and which ones aren’t behaving as nice,” he told journalists at a recent press event. “And one of the things that we found is that operating systems — if you don’t give them any work to do, they’re by default very, very noisy. And we saw lots of differences, going from Linux kernel to Linux kernel, testing this feature.”

Not every Linux distribution was careful with how it scheduled process threads, his team discovered, especially during what’s supposed to be their idle periods. Once the kernels got busy, their ability to play nicely with the L3 cache increased, and determinism improved as a result. But kernels spend a lot of time doing idle work, such as “garbage collection,” in-between busy periods.

Point of Contention

Before his retirement last July, Billy Cox was a software-level engineer, having spent seven years at Intel and the previous 37 years at HP. As Cox told journalists that day in 2014, although telcos such as Chunghwa Telecom first brought the non-determinism issue to Intel’s attention, he reasoned that the very same issue may already be plaguing everyday, enterprise workload orchestration.

When workloads are deployed at greater scales in ever-increasing capacities of cloud staging space, they get cloned. Container orchestrators spawn hundreds, perhaps thousands, of replicates. The same process, cloned amongst a handful of threads, generated the greatest latency and became its own worst enemy.

To address this issue, Cox’ team effectively simulated the problem by intentionally producing software routines that dragged L3 cache performance, for what Intel called “noisy neighbor” tests. Those routines were encapsulated in virtual machines — some of which were programmed to be “aggressors,” and others “victims.”

“My team defined a concept called contention,” he told journalists. “How much contention is happening on a given socket?  We largely measured around cache occupancy and the number of misses.” While Verplanke’s team would measure latency on each particular thread, Cox’s team would measure the overall, perceptible performance impact of contention.

Cox’s thinking worked like this: If CPUs were capable of monitoring and reporting their contention scores in real-time, then workload orchestrators could use these scores to determine which server nodes would be the most optimum, or least noisy, for handling each new process. Contention, when expressed as a raw value attainable through a simple API call, would give operators at the top of the stack some interest in — and perhaps, in turn, responsibility for — what’s going on at the bottom of the stack.

At the opposite end of the stack from Intel were the IT managers conducting workloads using suites such as VMware vRealize Operations Manager, and also an emerging class of developers using open source schedulers such as Apache Mesos.

Workloads impact determinism. Without determinism, the most important real-time workloads cannot be efficiently orchestrated through VMs or containers at even a moderate scale. And the more layers of abstraction are put in place, the more difficult it becomes to isolate the specific causes and apply remedies in a way that can be efficiently automated.

Cox’s idea would involve creating “least common denominators” — quantities of measurement that demonstrably affect the work done by people on both sides, and that could be of interest to them as well. With that, Intel could perhaps leverage the assistance of software developers in clearing the next set of performance hurdles that hardware engineers could no longer clear by themselves.

Play Nicely, Now

Intel’s existing stake in virtualization was its VT technology — code embedded in the processor that enabled hypervisors to go beneath the operating system, much lower in the stack, to the processor level for the resources they needed to host virtual machines. That wasn’t deep enough, though — not for this mission. The L3 cache sits at the heart of the CPU.

So first, Intel would create a technology called cache monitoring — a way for developers higher up in the stack to arbitrarily tag a thread belonging to a process. This way, for the first time, when performance was perceived to be lagging, developers could at least get a handle on the possible perpetrators.

[SCM]actwin,0,0,0,0;Intel(R) Xeon(R) E5-2600 v3 Press Workshop Comms Update.pdf - Adobe Acrobat Reader DC AcroRd32 4/21/2016 , 11:01:54 AM

Next, Intel would devise cache allocation, a way to partition chunks of the cache for designated workloads. Partitioning would ensure that noisy neighbors could not interfere with one another — more technically, that a cache miss generated by one thread wouldn’t diminish the performance, and therefore the level of determinism, of other threads.

Verplanke presented data that detailed how a single process performing an ordinary task could easily generate the same degree of latency typically associated with a denial-of-service attack.

[SCM]actwin,0,0,0,0;Intel(R) Xeon(R) E5-2600 v3 Press Workshop Comms Update.pdf - Adobe Acrobat Reader DC AcroRd32 4/21/2016 , 10:58:18 AM

“If you don’t have the ability to partition off your cache,” he told journalists, “you’d have a lot of evictions, your interrupt handler would be scheduled out, and you would incur a quite significant increase in latency.” His graph, shown here, depicted that latency surge in red. Eliminating the interplay between processes vying for the same L3 cache totally eliminated the latency.

140604 Edwin Verplanke 02

To boil it down to the meat of the matter: The test involved a single process, cloned. It was the first hard evidence that the noisiest neighbors in a Xeon neighborhood were clones of the same process.

And that’s the problem. When workloads are deployed at greater scales in ever-increasing capacities of cloud staging space, they get cloned.  Container orchestrators spawn hundreds, perhaps thousands, of replicates. The same process, cloned amongst a handful of threads, generated the greatest latency and became its own worst enemy.

The very act of orchestrating workloads at scale was defeating the processor’s ability to host workloads at scale.

Discovering the cause would point the way towards an eventual solution: one which made its way to general availability in March 2016. In Part 2 of this story, to be posted Monday, we’ll see the results of Intel’s latest trials of this solution, and discuss the roadblocks that remain towards bringing software developers and hardware engineers together to implement it.

Intel is a sponsor of The New Stack.

Photos by Scott Fulton III.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.