IBM, Red Hat Bring Load-Aware Resource Management to Kubernetes

Engineers at IBM Research and Red Hat OpenShift recently teamed up to tackle a problem that had been plaguing the Kubernetes space — how to more efficiently manage resources in a way that took existing loads and patterns into account. While Kubernetes is known for its ability to scale to handle large traffic increases, it does so by scaling out horizontally — adding more pods and nodes — rather than vertically. In addition, the Kubernetes scheduler places application pods with no regard for the current actual utilization of nodes.
Alaa Youssef, a research manager for Container Cloud Platform at IBM T.J. Watson Research Center, explained that Kubernetes, as it is, is “a really good bean counter,” but fails to take into account the current and historic utilization that could make it operate more efficiently.
“We are trying to add more intelligence to the scheduling and to the scaling,” he said. “We have to start by extending the Kubernetes framework. One thing we enabled in Kubernetes’ scheduling framework is to be able to use the actual utilization information of the nodes in the scheduling decision. This is something, believe it or not, that was not available in Kubernetes until today. Anybody can come now and do even smarter algorithms that leverage this node utilization information when making better scheduling decisions.”
To address this problem, the teams decided to build solutions on upstream Kubernetes that would approach it from two different directions, both of which they recently released as open source projects.
First, the Trimaran load-aware scheduler plug-ins work to “factor in the actual usage on the worker nodes — something Kubernetes doesn’t take into account,” as IBM writes in a blog post. With this usage in mind, Trimaran places the application on the node that would best provide efficient cluster utilization, avoiding nodes with high load fluctuations that may impact the application performance.
Next, the Vertical Pod Autoscaler (VPA) is an open source controller, which is now available in Red Hat OpenShift 4.8, that allows developers to automatically resize their containers. VPA manages scale by reviewing both historic and current CPU and memory usages for containers in pods and then updating resource limits and requests based on those values.
With VPA, Youssef said that developers could be less precise about both their current and future needs, in terms of necessary resources for a container. “You can rely on the vertical autoscaler to detect that you need more resources and start asking Kubernetes to give you more resources on the fly,” explained Youssef.
In essence, the two solutions approach the same problem from opposite sides: Trimaran tries to optimize performance beforehand when scheduling the load according to current and historic usage, while VPA does so by vertically scaling after the fact, changing the size of its container to the right size based on observed resource use by the application.
Priya Nagpurkar, director of cloud platform research at IBM Research, explained that Trimaran and VPA are just the first steps, with Trimaran, in particular, serving as a potential foundation for future innovation. “Kubernetes gives you a lot of the knobs you need for scheduling and placement, but there are still a lot of unexploited opportunities in how to drive those knobs intelligently. The team here has also been involved in architecting Kubernetes, in a way that the right knobs are present,” said Nagpurkar. “Kubernetes, as it started out, had some mechanisms to schedule and so on, but now, with what we have contributed, you can drive it with more intelligence and more algorithms.”
Already, PayPal has taken advantage of Trimaran in production, building its Load Watcher, a cluster-wide aggregator of metrics, for Trimaran.
With Trimaran and VPA, both Nagpurkar and Youssef said that they see a future where AI and ML can be used to increase efficient operation and performance.
“The goal here is to take a less operational and more application-centric view of the cloud platform and make it more adaptive and more automated,” said Nagpurkar. “I think in the future, this will just be transparently used on the platform without the user necessarily having to do a lot, but I think that’s the longer-term vision. The intelligent platform will just be seamlessly integrated with those tools and framework.”