Kubeflow: Where Machine Learning Meets the Modern Infrastructure

The extensibility and scale offered by Kubernetes make it an ideal choice for building modern platforms. Kubeflow is an open source, niche, a specialized machine learning platform that takes advantage of Kubernetes capabilities to deliver end-to-end workflow to data scientists, ML engineers, and DevOps professionals.
This article series will introduce Kubeflow and its capabilities to developers and operators. This first installment introduces Kubeflow, while the remaining parts cover the core building blocks, installation, configuration, and managing the lifecycle of machine learning models through end-to-end examples. By the end of this series, you will gain a thorough understanding of Kubeflow and its capabilities.
The key advantage of using Kubeflow is that it hides away the complexity involved in containerizing the code required for data preparation, training, tuning, and deploying machine learning models. A data scientist using Kubeflow is least expected to know the concepts of pods and statefulsets while training a model. It’s a true machine learning platform with its own UI, API, and even command-line tools that abstract the infrastructure based on Kubernetes and related technologies.
Enterprise ML platforms such as Amazon SageMaker, Azure ML, Google Cloud AI, and IBM Watson Studio available in the public cloud deliver end-to-end capabilities. Kubeflow is an excellent alternative to these services for customers considering an on-prem, open source ML platform. It comes close to the features and capabilities delivered by most of the commercial offerings without the lock-in.
Overview of Kubeflow
Kubeflow is a platform for data scientists who want to build and experiment with ML pipelines. Kubeflow is also for ML engineers and operational teams who wish to deploy ML systems to various development, testing, and production-level serving environments.
There is a misconception that machine learning is only about mastering algorithms and the code to train the models. According to a now-famous paper, “Hidden Technical Debt in Machine Learning,” from the Neural Information Processing Systems (NIPS) conference in 2015, only a small fraction of real-world ML systems is composed of the ML code, as shown by the little black box in the middle. The necessary surrounding infrastructure is vast and complex.
The process of operationalizing a machine learning model goes beyond the code written for model training. Deploying, scaling, and managing ML models in production is no different from managing a mission-critical application. The process is complex that demands collaboration between data scientists, developers, ML engineers, and operators.
Kubeflow takes advantage of multiple cloud native technologies, including Istio, Knative, and Tekton. It leverages core Kubernetes primitives such as storage classes, deployments, services, and custom resources. With Istio, and Knative, Kubeflow gets the capabilities such as traffic splitting, blue/green deployments, canary releases, scale to zero, and auto-scaling. Tekton brings the ability to build images natively within the platform.
Since it is layered on top of Knative, Kubeflow abstracts the components of Kubernetes to expose a platform to ML developers and engineers.
Apart from Istio, Knative, and Tekton, Kubeflow heavily relies on various cloud native and open source projects, including Scikit-learn, TensorFlow, PyTorch, Apache MXNet, Argo Workflows, MinIO, Apache Spark, Prometheus, and Seldon Core. This integration delivers end-to-end capabilities, including data preparation, training, and serving.
From data acquisition to model monitoring, machine learning has multiple steps involved in operationalizing machine learning models. Kubeflow offers tools that map to each stage of the machine learning workflow. These tools are translated into Kubernetes resources such as pods, statefulsets, jobs, deployments, and services behind the scenes.
Google, one of the key contributors to the project, integrated some of the components of Kubeflow with Cloud AI, which is the company’s manage ML PaaS offering. This integration turned Kubeflow into a hybrid ML platform that spawns on-prem data center and the public cloud. Customers can train the models on-prem and serve them in the public cloud or vice versa.
Kubeflow is an excellent example of how to build a sophisticated platform on top of Kubernetes. It extends the promise of Kubernetes to deliver highly distributed, parallelized, at-scale machine learning model training and deployment.
Though Kubernetes lacks support multitenancy, Kubeflow makes it possible to isolate environments used by individuals and teams. The shared multitenant machine learning platform delivered by Kubeflow makes it an ideal candidate for enterprises.
Kubeflow Usecases
Hosting datasets
With Kubeflow, large datasets can be stored centrally and shared with data scientists working on various projects. By leveraging Kubernetes persistent volumes and claims based on shared filesystems, Kubeflow makes it possible to share hosted datasets across projects.
Feature Engineering
Feast (Feature Store), an optional component of Kubeflow, is an operational data system for managing and serving machine learning features to models in production. Data scientists and ML engineers can use Feast to define, manage, discover, validate, and serve ML models’ features during training and inference.
Running Jupyter Notebooks
Kubeflow comes with a multitenant notebook server based on JupyterHub. Each notebook server can be based on a different container image customized for the project. For example, the team working on data preparation may launch a notebook server based on CPU. In contrast, the ML engineers can launch a different notebook server for performing distributed training based on GPUs.
Distributed training of ML models
Kubeflow makes it easy to perform distributed machine learning jobs based on mainstream frameworks such as TensorFlow and PyTorch. It leverages the scheduler and custom controllers of Kubernetes to perform training at scale.
Creating repeatable experiments
Training sophisticated models based on deep learning and neural networks are similar to science experiments. Researchers and ML engineers experiment with a variety of parameters before arriving at a model with satisfactory results. Kubeflow Pipelines bring a consistent and repeatable experimentation environment.
Hyperparameter tuning
Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping, and neural architecture search (NAS). With Katib integrated with Kubeflow, you can easily tune the learning rate, the number of layers in a neural network, and the number of nodes in each layer. In addition to hyperparameter tuning, Katib offers a neural architecture search feature that maximizes a deep learning model’s predictive accuracy and performance.
Model serving
One of the key advantages of Kubeflow is serving the models for inference. Based on KfServing and Seldon Core, Kubeflow brings some of the proven scaling techniques and managing microservices to model serving. Both KfServing and Seldon Core support mainstream frameworks such as Scikit-learn, TensorFlow, PyTorch, and MXNet for model serving.
Model monitoring
Kubeflow integrates with Prometheus for model monitoring. Operators and SREs can use the familiar Grafana dashboard to monitor the performance of deployed models.
In the next part of this series, we will install Kubeflow on a cluster with one or more GPU nodes. I will introduce the DeepOps toolkit from NVIDIA and demonstrate how it can be used to automate the installation of Kubeflow. Stay tuned.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.