Kubeflow 1.0 Brings a Production-Ready Machine Learning Toolset to Kubernetes

For developers looking to more easily parallelize (and more) their machine learning (ML) workloads using Kubernetes, the open source project Kubeflow has reached version 1.0 this week. The now production-ready project offers “a core set of stable applications needed to develop, build, train, and deploy models on Kubernetes efficiently.”
The project was first open sourced in December 2017 at KubeCon+CloudNativeCon and has since grown to hundreds of contributors from more than 30 participating organizations such as Google, Cisco, IBM, Microsoft, Red Hat, Amazon Web Services and Alibaba. Alongside the blog post from the Kubeflow team itself, Google has offered a post on how Kubeflow works with Anthos, while IBM’s Animesh Singh explores the “highlights of the work where we collaborated with the Kubeflow community leading toward an enterprise-grade Kubeflow 1.0.”
In an interview with The New Stack, Singh explained the origins of Kubeflow as one attempting to simply bring TensorFlow to Kubernetes.
“The core idea behind the project when it germinated was to ensure that we can run TensorFlow in a first-class manner on top of Kubernetes,” said Singh. “That means being able to distribute training across multiple Kubernetes containers and then provide a ready-made solution where someone, if they want to run distributed training using TensorFlow and Kubernetes, they can get going easily.”
Singh explains that the basic problem boils down to the amount of data and a lack of processing power to train models. With Kubernetes, these workloads can more easily be parallelized, reducing the time to train models from what he says it has come to recently, which is “days, if not weeks”.
“The amount of data has increased, so the training times have gone up, and they, in general, are very computationally intensive — not something you can possibly handle on a laptop,” said Singh.
Kubeflow, however, goes beyond just training models, instead of tackling the whole machine learning process with a series of tools.
As shown, Kubeflow 1.0 allows users to use Jupyter to develop models, use a Kubeflow tool like fairing (Kubeflow’s python SDK) to build containers and create Kubernetes resources to train their models, and finally use KFServing (built on top of Knative) to create and deploy a server for inference.
Currently, Kubeflow comes with a number of “graduated” applications, which include the Kubeflow UI, the Jupyter notebook controller and web app, a Tensorflow Operator (TFJob) and a PyTorch Operator for distributed training, kfctl for deployment and upgrades, and a profile controller and UI for multiuser management. Additionally, the project team explains that there are several more applications currently under beta, which they plan to graduate to version 1.0 in a future release. Those applications include pipelines for defining complex ML workflows, metadata for tracking datasets, jobs, and models, katib for hyper-parameter tuning, and some additional distributed operators for other frameworks like xgboost.
The Kubeflow team writes that getting started with Kubeflow is as easy as a single command, offering pre-built manifests for Google Cloud Platform, AWS, IBM, Google Anthos, and more. According to Singh, the “core goal is not to tie it to any vendor” and Kubeflow should work on any Kubernetes cluster, as long as you are using supported versions — right now, that means Kubernetes 1.15 and lower, and Istio 1.31 and lower.
Amazon Web Services, Red Hat and KubeCon+CloudNativeCon are sponsors of The New Stack.