Top Stories

A Tip from Mechanical Engineering: Use Control Theory to Better Auto-Scale Systems

24 Apr 2018 1:53pm, by

Want to more efficiently automate the scaling up and down of your IT systems? Take a close look at the traditional practice of control theory, advised Allan Espinosa, a DevOps engineer for Bloomberg, and author of “Docker High Performance,” who spoke last week in a session at the Cloud Foundry Summit in Boston.

Though widely used in mechanical engineering for centuries, control theory has yet to be deployed much to automatically adjust the performance of the distributed IT systems. This is surprising, given how much scalable services from Kubernetes to serverless rely on an ability to quickly adjust the workload demand to the available resources as accurately as possible.

Engineers who might be tempted to delve into the murky arts of machine learning for improvements might be delighted to find out how easy control theory is to use instead. Espinosa noted that these algorithms could be applied to any function that can be auto-scaled, not only for load-balancing, but also for, say, judging the ideal time for auto-timeouts, or to estimate the amount of memory that should be allocated to each worker connection.

Control Freak

In short, control theory is “the mathematical study of how to manipulate the parameters affecting the behavior of a system to produce the desired or optimal outcome,” to borrow a definition from Wolfram Alpha. “Control theory allows us to have an architecture of a self-regulating system, which operates on feedback,” Espinosa said.

Espinosa showed an early example of an application for control theory: a centrifugal governor, developed by James Watt in 1788, that regulated the speed of steam engines. As the engine produced more power, a set of spinning orbs would capture this change in force through changes in their gravitational pull, which, in turn, was used to adjust the throttle control of the engine:

“With a linear model, you can model the relationship between the input and the output,” he said. IT systems aren’t entirely linear — there is always some noise or random variance in the system behavior. But a linear model can go a long way toward characterizing the state of a system.

For this conference, Espinosa showed how his control theory equation could be used to improve the performance of Scale, Cloud Foundry’s native auto-scale component. Espinosa’s algorithm models how fast or how slow the CPU activity changes, and balances that against recent performance.

The algorithm calculates how quickly the workload for a CPU changes, either through an increase in workload or a decrease. As expected, if the workload rises above a certain threshold, more CPUs can be added. If the workload dips, then CPUs are taken offline.  Other factors that can be taken into account for a control theory model include how far from the target the system is at any given point (“the accuracy”), as well as the time taken to get to the ideal level (“recovery time”).

Espinosa demonstrated this principle with a Node.js application running on Cloud Foundry, along with a simulated load generator running on Kubernetes. When he quadruples the traffic, the Cloud Foundry controller responds by adding more CPUs. The response was much more smooth than steep oscillations in resources that typically accompany traditional auto-scaling algos.

One company investigating the use of control theory for cloud-native systems has been Pivotal. Company researchers are currently looking at how to add elements of control theory to the company’s feedback controller for its Riff serverless “function-as-a-service” (FaaS) software. The current method of auto-adjusting, like many implementations, more ad-hoc in nature.

After the talk, Pivotal senior software engineer Jacques Chester noted that auto-scaling FaaS from a cold start can be somewhat tricky to code for, as you want zero delay for the first instance, though you’d want to have more traditional control measures for subsequent instances. Riff itself was built around streaming as a primary use case, so traditional metrics such as CPU usage isn’t really helpful. Instead, Riff relies on observing activity and offsets from consumers and producers, Chester explained.

Cloud Foundry and Pivotal are sponsors of The New Stack.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.