Primer: Kubeflow Streamlines Machine Learning with Kubernetes

This article is a post in a series on bringing continuous integration and deployment (CI/CD) practices to machine learning. Check back to The New Stack for future installments.
Kubeflow was created to make it easier to develop, deploy and manage machine learning applications. It’s a composable, scalable, portable machine learning stack based on Kubernetes that was originally based on the way Google was using Tensorflow on Kubernetes.
Kubeflow was first released as an open-source project in December 2017. It integrates commonly-used machine learning tools like Tensorflow and Jupyter Notebooks into a single platform. Kubeflow’s innovation is creating a better way for all of the existing machine learning tools to work together to create a cohesive pipeline, and to automate more of the machine learning application lifecycle — including data logistics and versioning management — than was previously possible.
David Aronchick, head of open source machine learning strategy at Azure, one of the Kubeflow co-founders and an early Kubernetes project manager, said he was inspired to start Kubeflow after hearing dozens of data scientists describe the spiderweb of applications they cobbled together to get their work done. It reminded him of pre-Kubernetes software development.
“Kubeflow is designed to address the full machine learning application development lifecycle, from model development to deployment and modifications over time,” says Michelle Casbon, senior engineer at Google and active Kubeflow contributor. “This process is fundamentally different from commonly found web or mobile architectures.”
“Kubeflow was built with one specific goal, and that was to simplify machine learning workflows,” explains Jim Scott, director of enterprise architecture at MapR. “If you’re not doing machine learning, don’t look at Kubeflow, it’s the wrong tool for the job.”
Even as a relatively new tool, Kubeflow use has increased, including among enterprise users. According to Scott, that’s because there simply aren’t any other great options for managing ML/AI workflows. Casbon agrees, adding that most machine learning deployments she’s seen have used either highly customized closed-source platforms or are one-off implementations. With Kubeflow, data scientists can access a similar level of workflow automation that software engineers already use — and the hope, Casbon says, is that it will help build industry-standard best practices for managing machine learning pipelines.
Managing Complexity
Before Kubeflow, many data scientists used workflow tools that were designed for software engineering, which left them trying to cobble together the parts that didn’t fit. A typical machine learning workflow might include the following steps: data ingestion, data analysis, data transformation, data validation, model building, model training, model validation, training at scale, serving the model and monitoring. There are often dozens or hundreds of iterations as the models are adjusted and new data is incorporated.
“Complexity quickly grows when multiple models are involved, and maintaining this type of system is unwieldy,” Casbon explained. “Identifying and fixing errors is less straightforward (than in software engineering) and models are difficult to keep track of.”
“It’s been a longstanding problem that these different pieces in the lifecycle of machine learning or deep learning projects had to be done manually,” explained Karthic Rao, machine learning consultant and Kubeflow contributor. “There hasn’t been a way to automate or chain up these processes end-to-end.”
Especially in the context of enterprise machine learning applications, there are other challenges. Working with multiple teams on machine learning applications requires that everyone use the same library versions, which is often hard to ensure without a tool managing updates. In addition, once you’ve defined your machine learning stack, you might need to update parts of it or modify the stack.
“In a corporate setting that can be very challenging,” explains Carmine Rimi, AI/ML product manager at Canonical. “You could spend more time keeping your stack up-to-date than doing your work.”
Many data scientists struggle just to keep track of versioning data and managing the dozens of model iterations, which becomes even more difficult when more applications and features are involved.
Lastly, data management and logistics can be a challenge. Most machine learning applications use terabytes of data for both training and in production — the data is also constantly changing.
“Kubeflow provides a platform for solving many of these issues,” says Casbon. The composable nature of Kubeflow makes it easy to define your stack and standardize it across the company. Kubeflow can also automatically manage updates, so the stack is always updated and everyone is using the same versions.
Kubeflow also helps manage hyperparameter tuning, giving data scientists a tool to keep track of the iterations they go through to develop, test and train a model so that they know which set of variables produced the most useful results.
Perhaps most importantly, Kubeflow uses automation to stitch together the different stages in a machine learning applications’ lifecycle, reducing the number of steps that have to be done manually. That both frees up time and reduces the chance of errors.
Why Kubernetes
Kubeflow is based on Kubernetes and requires a Kubernetes environment, although it can be any Kubernetes environment, plain vanilla or Google Kubernetes Engine. This offers several advantages in a machine learning context. Perhaps most obviously, as Kubernetes becomes the industry-standard container orchestration platform, being able to run machine learning applications in the same environment as the rest of the company’s application reduces IT complexity. It also makes it much easier to develop a model locally, using a laptop, before pushing the application to a production Kubernetes environment.
There are also specific ways that Kubeflow leverages Kubernetes features. Autoscaling is extremely important to machine learning models because training can be very resource-intensive while serving the model is generally less so. Kubernetes already manages autoscaling, Kubeflow just puts that feature into the machine learning context. Kubernetes also manages the deployment and orchestration of the containers in the machine learning application.
“If it weren’t for Kubernetes, Kubeflow wouldn’t exist,” Scott says. “Other arbitrary tools would be getting used.”
Kubeflow is barely a year old, and it’s already being used to manage enterprise-level machine learning projects. Google uses Kubeflow both for internal projects and MapR uses Kubeflow both in its machine learning products and in custom applications built for clients. “I’d like to see Kubeflow become like what Kubernetes is today,” Aronchick says. Most importantly, though, he hopes Kubeflow will reduce some of the barriers to building successful machine learning applications and expand practical applications of machine learning in business, healthcare and other industries.
For more information on Kubeflow, listen to this TNS interview with Aronchick as well as Mesosphere product marketing manager Chris Gaun and Mesosphere’s technical lead for community projects Jörg Schad about Kubeflow:
KubeFlow: Manage AI Workflows With Kubernetes
The Cloud Native Computing Foundation, which manages Kubernetes, is a sponsor of The New Stack.
Feature image by Paweł Czerwiński on Unsplash.