Kubernetes / Machine Learning / Storage / Sponsored / Contributed

Tutorial: Configure Nvidia DeepOps to Use Portworx as Storage for Kubeflow

30 Apr 2021 10:25am, by

Nvidia DeepOps is a collection of scripts to configure Kubernetes and Kubeflow on CPU and GPU hosts. It comes with NFS as the default storage choice. This tutorial will demonstrate how to configure DeepOps to use Portworx by Pure Storage as the default storage engine for running the Kubeflow platform and the machine learning workloads.

The setup includes a hybrid collection of CPU and GPU hosts which will be a part of the Kubernetes cluster. One of the key requirements is that all the hosts must run the same Linux kernel. It’s a good idea to install Ubuntu 18.04 LTS server on all the hosts before starting the installation.

Step 1: Customizing DeepOps Ansible Playbook

Since Nvidia DeepOps relies on Kubespray and Red Hat Ansible, you need a bootstrap machine to run the playbooks. This can be an Ubuntu VM that has access to the target host machine. I used an Ubuntu 18.04 VirtualBox VM running on my Mac as the bootstrap machine.

Make sure you can SSH into the hosts without a password which means you need to have the SSH private key available on the bootstrap machine.

Start by cloning the DeepOps GitHub repository on the bootstrap/provisioning machine.

Switch to the most stable version of the installer.

Install the prerequisites and configure Ansible.

Next, update the inventory file with the host details.

Add the hosts and the IP addresses to the inventory file.

Add kf-master to [kube-master] and [etcd] groups. Add remaining nodes to [kube-node] section. This defines our cluster with one master node and four workers.

Next, we need to disable NFS as the storage engine. Navigate to the deepops/config/group_vars directory and edit k8s-cluster.yml to disable the configuration of NFS server and client.

Set k8s_nfs_client_provisioner and k8s_deploy_nfs_server to false and save the file.

Step 2: Install Kubernetes and Nvidia GPU Operator

We are now ready to kick off the installation. Run the below command wait for it to finish. This may take a few minutes.

This runs a customized Kubespray Ansible playbook to install the Kubernetes cluster followed by the installation of the Nvidia GPU Operator on the GPU host. Wait till you see something similar to the below output.

Step 3: Verifying the Installation

Once the installation is done, copy the configuration file and kubectl binary to appropriate locations on the bootstrap machine to access the cluster.

We are now ready to access the Kubernetes cluster.

Let’s test if Kubernetes is able to access the GPU.

The GPU operator is successfully installed. We can verify it by checking the pods in the gpu-operator-resources namespace.

The next step is to install and configure the Portworx storage cluster as the default storage engine for Kubeflow.

Step 4: Install and Configure Portworx Storage Cluster

Portworx is a modern, distributed, cloud native storage platform designed to work with orchestrators such as Kubernetes. The platform, from the company of the same name, brings some of the proven techniques applied to traditional storage architecture to the cloud native environment.

For a detailed overview of Portworx architecture and the installation process, refer to this guide that I published at The New Stack.

Starting version 2.7, Portworx uses the operator pattern to install and configure the storage cluster. I strongly recommend this approach due to the simplicity and efficiency.

Choose the operator in the first step of the installation wizard available at Portworx Central.

In the next step, choose appropriate options for the on-premises, bare-metal cluster.

Accept the defaults in the next two steps, and run the commands as shown in the last step.

Once you run the kubectl commands as shown by the wizard, you will have the Portworx cluster up and running. Verify the installation with the below command.

For Kubeflow to function, we need to have a storage class that supports dynamic provisioning. Let’s go ahead and create a Portworx storage class optimized for running stateful workloads such as MySQL and MinIO which are the core building blocks of Kubeflow.

Notice that we annotated the class with storageclass.kubernetes.io/is-default-class: "true" to make it the default storage class used for dynamic provisioning.

Apply the storage class specification and verify the configuration.

With everything in place, let’s go ahead and install Kubeflow.

Step 5: Install and Verify Kubeflow

Start the Kubeflow installation by running the following command:

After the installation is done, make sure that all pods in the kubeflow namespace are running.

Kubeflow’s core building blocks such as MySQL and MinIO are now backed by Portworx volumes based on the storage class we created.

You can access the Kubeflow UI by hitting the 31380 port on the master.

Congratulations! You have successfully installed Kubeflow on a multinode, hybrid cluster with CPU and GPU hosts backed by the Portworx storage engine.

In the next part of this series, we will configure Notebook Servers to perform MLOps. Stay tuned!

Feaure image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.