Choose the Right Storage Engine for Kubeflow and ML Workloads

Kubeflow is one of the unique workloads designed for Kubernetes. This platform abstracts the underpinnings of Kubernetes by exposing a set of integrated functionalities for data scientists, developers, machine learning engineers, and operators. It’s also unique because of the prerequisites it has for running a robust, cloud native, and enterprise-ready machine learning platform.
Like any other mature applications designed for Kubernetes, Kubeflow heavily relies on the storage layer for achieving high availability and delivering expected performance.
There are many open source and commercially available storage engines for Kubernetes, which can be used with Kubeflow. From Ceph/Rook to Red Hat‘s GlusterFS to good old NFS, customers can choose from a variety of options. But, no single storage layer meets all the requirements of running the Kubeflow platform, and the diverse set of components such as Notebook Servers, Pipelines, and KFServing.
When you use Kubeflow, you are expected to meet the storage requirements of the platform and the ML jobs that you run through Jupyter Notebooks, Pipelines, Katib, and KFServing. It’s important to know that the Kubeflow platform and the ML jobs have distinct storage requirements.
Let’s take a closer look at the storage configuration that these two layers: The Kubeflow platform and custom jobs created by users that run on the platform.
Storage Prerequisites for Kubeflow Platform
Kubeflow is a comprehensive stack assembled from a variety of open source components and projects. The platform is based on Argo Workflow, Istio, Jupyter Hub, Knative, MinIO, MySQL, and Seldon.
There are multiple operators, CRDs, and Kubernetes objects that integrate these diverse open source projects to deliver the platform capabilities. For example, the tf-job-operator
, pytorch-operator
, and mxnet-operator
are a combination of custom resources and operators that can run distributed training jobs.
Below is a subset of CRDs and operators created by Kubeflow:
Kubeflow’s CRDs and operators depend on some of the stateful services deployed as Kubernetes statefulsets and deployments with external PVCs.
Kubeflow needs a storage class that supports dynamic provisioning to create the PVCs on the fly.
Stateful services such as MySQL and MinIO need a persistent volume (PV) and persistent volume claim (PVC) backed by a high throughput storage layer.
When you run kubectl get pv
immediately after installing Kubeflow, you see the persistent volumes created for MySQL and MinIO.
The PVs are bound to the PVCs attached to the pods running within the kubeflow
namespace.
These PVCs are utilized by the pods highlighted in the screenshot below.
To ensure that the stateful services get the expected throughput and I/O, they need a high-performance storage layer. The other important aspect is that the stateful services are not configured as statefulsets. They are normal deployments backed by a normal PVC.
If you configure Kubeflow with shared filesystems such as NFS and GlusterFS, you may not get the expected throughput.
The key takeaway is that the Kubeflow platform layer needs a highly available, performant, and reliable storage engine that can deliver the throughput and I/O performance of write-intensive workloads such as MySQL and MinIO.
Storage Requirements for Machine Learning Jobs running on Kubeflow
Now, let’s take a look at a typical use case of Kubeflow — multiple teams within an organization leveraging the Notebook Server to build and deploy a deep learning model.
It all starts with the DevOps building the container images for individual teams — data scientists, ML engineers, and developers. The data science team prepares the data and performs feature engineering. The final dataset is stored in a shared location that is accessible to ML engineers training and tuning the model. The trained model is persisted to another shared location which is used by developers building the model serving and inference application.
To visualize how the DevOps team has created three unique Notebook Servers for each of the teams, take a look at the screenshot below.
The Notebook Servers are created under a dedicated Kubernetes namespace. In this example, they are a part of the mldemo
namespace. Notice how each Notebook Server is translated into an instance of the statefulset.
The pods, dataprep-0
, train-0
, and infer-0
are associated with the respective Notebook Server running in Kubeflow.
Each Notebook Server instance has a dedicated PVC with RWO mode that becomes the home directory of the user. To enable sharing of the artifacts such as datasets and models, each Notebook Server is also associated with a shared PVC with RWX mode that supports multiple read and write operations.
To support this scenario, we create two shared PVCs beforehand and attach them to the Notebook Server.
The shared PVs are attached to the Notebook Server during the creation.
With this approach, DevOps can enable a shared environment for all the teams to collaborate effectively. Shared volumes are one of the critical requirements for Kubeflow applications and ML jobs.
Since the majority of the cloud native storage engines don’t deliver shared volumes out of the box, customers end up using NFS or GlusterFS for Kubeflow.
We will explore this concept further in the upcoming MLOps tutorial based on Notebook Servers and Kubeflow Pipelines.
Portworx By Pure Storage for Kubeflow
As we have seen, Kubeflow needs a combination of storage engines — a high throughput, reliable backend for running stateful components, and a shared storage layer for the jobs running on Kubeflow.
Portworx by Pure Storage is a cloud native, container-granular, enterprise-grade storage engine for Kubernetes. It’s one of the unique storage platforms with capabilities such as replication, encryption, shared volumes, and in-built high availability and failover.
For Kubeflow, Portworx by Pure Storage becomes the natural choice due to the following reasons:
- Dynamic storage class optimized for running databases that need high availability and throughput
- In-built replication and high availability for regular stateful pods and without configuring statefulsets
- Sharedv4 volumes provide the out-of-the-box capability to create multi-writer shared volumes
For stateful services such as MySQL, MinIO, Jupyter Notebooks, the following Portworx storage class delivers all expected capabilities.
Since the storage class is annotated as the default, dynamically provisioned PVCs will be automatically based on this.
The parameter repl
ensures that the data has at least three copies which bring high availability. The io_profile
parameter implements a write-back flush coalescing algorithm which ensures that replicas do not fail (kernel panic or power loss) simultaneously in a 50 ms window.
For provisioning shared volumes, we create a different storage class annotated as a sharedv4 volume.
The PVCs based on the above storage class support RWX mode making it possible to share ML artifacts across teams and Notebook Servers.
Portworx by Pure Storage is the only storage platform in the market that provides seamless support for dedicated volumes (RWO) and shared volumes (RWX) with no compromise in performance and throughput.
Check back next Friday for the next part of this series, where I will walk you through the steps involved in integrating Portworx Essentials, the free container-native storage engine from Portworx by Pure Storage with NVIDIA DeepOps. Stay tuned!