How Open Data Hub Speeds AI Development and Fixed a Kubernetes Bottleneck
Red Hat sponsored this post.
Many companies wonder how to properly apply AI/ML tools to their work and line of business applications. One good way to find such an application’s use is to take a gander at what the rest of the business world is doing with machine learning, and attempt to find a location inside your organization where it could be applied.
As an example, much has been made of the Netflix recommendation engine, which was the product of a public $1 million algorithm challenge. The resulting feature helps millions of Netflix subscribers navigate an enormous library of shows and movies. That increases user enjoyment of the platform, and helps the service to become more useful to the use.
How can this recommendation engine model be applied to your business? How is it applied to our own business, here at Red Hat? We have some top minds working on these machine learning applications, and in the process we’ve discovered some interesting things about the open source projects that fuel the Red Hat OpenShift Platform.
Bottlenecked Upstream Open Source Project
What if, when building out a machine learning model for your company, you came up against a limitation inside the very infrastructure that you’re using? Don Chesworth, Red Hat principal data scientist and team lead , encountered just such a situation while working on machine learning models.
Chesworth has been building three machine learning models. His focus is on using these models to help Red Hat support its customers. He’s using Pytorch, GPUs, Kubernetes and Open Data Hub (a 100% open source based Machine Learning platform) to build, deploy and run these models.
Of those three models, two are focused on analyzing the text of incoming support requests, and the third (currently in development) offers Netflix-like recommendations to customers: for example, “Customers who asked about this topic also found these five knowledge base articles useful.”
Unfortunately for Chesworth, fine tuning that last model to improve its accuracy was extremely slow. It could require around 100 tests and he calculated that each test would take about 73 hours to complete. That would never work for the planned use case, because it would take almost a year before determining if the model was even accurate enough to provide to Red Hat’s customers.
Red Hat has been planning for the AI/ML workloads of the future for some time. We have taken many of our learnings from these internal AI/ML initiatives, and contributed them to an open source community project called Open Data Hub (ODH).Open Data Hub is a Reference Architecture based on open source community projects such as Apache Spark, Pytorch, Tensorflow, JupyterHub, KubeFlow, Apache AirFlow, Seldon, etc. Deployment of various components of Open Data Hub are fully automated with an Open Data Hub Kubernetes operator. ODH also includes complementary Red Hat products such as Red Hat Ceph Storage and Red Hat Decision Maker.
AI Library is another of our open source projects, this time initiated by the Red Hat AI Center of Excellence team. It’s an effort to provide ML models as a service on Red Hat OpenShift. The development of these models as services is a community-driven open source project, to make AI/ML models more accessible.
Chesworth knew that the shared memory in containers is very limited, and PyTorch requires a lot of it to distribute data between multiple GPUs. When he requested the Open Data Hub team increase the shared memory size, there was a problem. Shared memory in Kubernetes containers isn’t configurable.
While Chesworth works at Red Hat, he’s not a kernel developer and nor is he a Kubernetes contributor. Fortunately, many of those people exist at Red Hat. Although the Open Data Hub team was eventually able to find a solution, it wouldn’t be just a matter of patching his system and being done with it. Chesworth knew that this would also require an upstream contribution of polished code that would solve the problem for everyone, everywhere.
“By default,” said Chesworth, “in containers you have shared memory of 64MB. Pytorch, when you distribute it across multiple GPUs, uses that shared memory to swap data across systems. There was no easy way in Kubernetes and CRI-O to change that default.”
For background, CRI-O is a lightweight container runtime for Kubernetes.
A workaround to this issue had already been addressed in OpenShift 3.11, but the solution made things more difficult for developers. Plus it was not a generalized Kubernetes-based solution. Thus, the Red Hat teams got cracking at solving this problem in the upstream Kubernetes community, so that the entire world could benefit from the fix — not just Chesworth.
Members of the Open Data Hub team submitted a patch upstream to CRI-O, which was then included in the Kubernetes 1.20 release. As a result of the patch, Chesworth’s training job now only takes 49 minutes, compared to an estimated 73 hours before.
That means that Chesworth won’t have to wait almost a year to determine the accuracy of his model, but can do so within a month. Plus he can train several more models throughout the year, providing data-driven predictions across the teams that he serves.
This allows Red Hat’s customer support teams to run the recommendation engine every single night, easily, and with times as short as 42 minutes between new content being added to the support document database and it being made available through the customer support recommendation engine. That’s important when your product is consumed worldwide, 24/7. If 100 support requests are suddenly coming in from around the world, and the newly found solution to the problem isn’t immediately surfaced to those users in a timely fashion, that’s going to be a lot of support tickets opened — taxing staff and resources.
If you can lower the amount of time each support contact requires by offering up solutions before the problem can be escalated, everyone wins: customers are happy they didn’t have to dig into a support database, or search a forum. Support staff are happy that they get to focus on the customers with unique problems. And Chesworth is happy, because his machine learning algorithms are doing just what they were built to do: save everyone time.
As in most enterprise software development endeavors, the real resource at the end of every day is time — and there is never enough of it. Time between compilation, tests and deployment for developer feedback. Time training algorithms. Time updating databases. Time spent tracking a bug into Kubernetes or the Linux kernel. If you can lower or eliminate these times, you can save your team tremendous amounts of time over all.
Sponsor note: The latest version of Red Hat OpenShift has just been announced.
Feature image via Pixabay.