Tutorial: Configure Nvidia DeepOps to Use Portworx as Storage for Kubeflow

Nvidia DeepOps is a collection of scripts to configure Kubernetes and Kubeflow on CPU and GPU hosts. It comes with NFS as the default storage choice. This tutorial will demonstrate how to configure DeepOps to use Portworx by Pure Storage as the default storage engine for running the Kubeflow platform and the machine learning workloads.
The setup includes a hybrid collection of CPU and GPU hosts which will be a part of the Kubernetes cluster. One of the key requirements is that all the hosts must run the same Linux kernel. It’s a good idea to install Ubuntu 18.04 LTS server on all the hosts before starting the installation.
Step 1: Customizing DeepOps Ansible Playbook
Since Nvidia DeepOps relies on Kubespray and Red Hat Ansible, you need a bootstrap machine to run the playbooks. This can be an Ubuntu VM that has access to the target host machine. I used an Ubuntu 18.04 VirtualBox VM running on my Mac as the bootstrap machine.
Make sure you can SSH into the hosts without a password which means you need to have the SSH private key available on the bootstrap machine.
Start by cloning the DeepOps GitHub repository on the bootstrap/provisioning machine.
1 |
git clone https://github.com/NVIDIA/deepops.git |
Switch to the most stable version of the installer.
1 2 |
cd deepops git checkout tags/21.03 |
Install the prerequisites and configure Ansible.
1 |
./scripts/setup.sh |
Next, update the inventory file with the host details.
1 |
vim config/inventory |
Add the hosts and the IP addresses to the inventory file.
1 2 3 4 5 6 |
[all] kf-master ansible_ssh_pass=ubuntu ansible_ssh_user=ubuntu ansible_host=10.0.0.50 kf-node-1 ansible_host=10.0.0.51 kf-node-2 ansible_host=10.0.0.52 kf-node-3 ansible_host=10.0.0.53 kf-node-4 ansible_host=10.0.0.54 |
Add kf-master
to [kube-master]
and [etcd]
groups. Add remaining nodes to [kube-node]
section. This defines our cluster with one master node and four workers.
Next, we need to disable NFS as the storage engine. Navigate to the deepops/config/group_vars
directory and edit k8s-cluster.yml
to disable the configuration of NFS server and client.
1 |
vim k8s-cluster.yml |
Set k8s_nfs_client_provisioner
and k8s_deploy_nfs_server
to false and save the file.
1 2 3 4 5 6 7 |
# NFS Client Provisioner # Playbook: nfs-client-provisioner.yml k8s_nfs_client_provisioner: false k8s_deploy_nfs_server: flase k8s_nfs_mkdir: flase # Set to false if an export dir is already configured with proper permissions k8s_nfs_server: '{{ groups["kube-master"][0] }}' k8s_nfs_export_path: '/export/deepops_nfs' |
Step 2: Install Kubernetes and Nvidia GPU Operator
We are now ready to kick off the installation. Run the below command wait for it to finish. This may take a few minutes.
1 |
$ ansible-playbook -u ubuntu -l k8s-cluster playbooks/k8s-cluster.yml |
This runs a customized Kubespray Ansible playbook to install the Kubernetes cluster followed by the installation of the Nvidia GPU Operator on the GPU host. Wait till you see something similar to the below output.
1 2 3 4 5 6 |
PLAY RECAP ********************************************************************* kf-master : ok=714 changed=166 unreachable=0 failed=0 skipped=1225 rescued=0 ignored=0 kf-node-1 : ok=412 changed=104 unreachable=0 failed=0 skipped=622 rescued=0 ignored=0 kf-node-2 : ok=412 changed=104 unreachable=0 failed=0 skipped=620 rescued=0 ignored=0 kf-node-3 : ok=412 changed=104 unreachable=0 failed=0 skipped=620 rescued=0 ignored=0 kf-node-4 : ok=412 changed=104 unreachable=0 failed=0 skipped=620 rescued=0 ignored=0 |
Step 3: Verifying the Installation
Once the installation is done, copy the configuration file and kubectl
binary to appropriate locations on the bootstrap machine to access the cluster.
1 2 3 |
cp config/artifacts/kubectl /usr/local/bin/ mkdir ~/.kube cp config/artifacts/admin.conf ~/.kube/config |
We are now ready to access the Kubernetes cluster.
1 |
kubectl get nodes |
Let’s test if Kubernetes is able to access the GPU.
1 2 |
export CLUSTER_VERIFY_EXPECTED_PODS=1 ./scripts/k8s/verify_gpu.sh |
The GPU operator is successfully installed. We can verify it by checking the pods in the gpu-operator-resources
namespace.
1 |
kubectl get pods -n gpu-operator-resources |
The next step is to install and configure the Portworx storage cluster as the default storage engine for Kubeflow.
Step 4: Install and Configure Portworx Storage Cluster
Portworx is a modern, distributed, cloud native storage platform designed to work with orchestrators such as Kubernetes. The platform, from the company of the same name, brings some of the proven techniques applied to traditional storage architecture to the cloud native environment.
For a detailed overview of Portworx architecture and the installation process, refer to this guide that I published at The New Stack.
Starting version 2.7, Portworx uses the operator pattern to install and configure the storage cluster. I strongly recommend this approach due to the simplicity and efficiency.
Choose the operator in the first step of the installation wizard available at Portworx Central.
In the next step, choose appropriate options for the on-premises, bare-metal cluster.
Accept the defaults in the next two steps, and run the commands as shown in the last step.
Once you run the kubectl
commands as shown by the wizard, you will have the Portworx cluster up and running. Verify the installation with the below command.
1 |
kubectl get pods -n kube-system -l name=portworx |
For Kubeflow to function, we need to have a storage class that supports dynamic provisioning. Let’s go ahead and create a Portworx storage class optimized for running stateful workloads such as MySQL and MinIO which are the core building blocks of Kubeflow.
1 2 3 4 5 6 7 8 9 10 |
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: standard-sc annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: kubernetes.io/portworx-volume parameters: repl: "3" io_profile: "db_remote" |
Notice that we annotated the class with storageclass.kubernetes.io/is-default-class: "true"
to make it the default storage class used for dynamic provisioning.
Apply the storage class specification and verify the configuration.
1 |
kubectl apply -f standard-sc.yaml |
With everything in place, let’s go ahead and install Kubeflow.
Step 5: Install and Verify Kubeflow
Start the Kubeflow installation by running the following command:
1 |
./scripts/k8s/deploy_kubeflow.sh |
After the installation is done, make sure that all pods in the kubeflow namespace are running.
1 |
kubectl get pods -n kubeflow |
Kubeflow’s core building blocks such as MySQL and MinIO are now backed by Portworx volumes based on the storage class we created.
You can access the Kubeflow UI by hitting the 31380 port on the master.
Congratulations! You have successfully installed Kubeflow on a multinode, hybrid cluster with CPU and GPU hosts backed by the Portworx storage engine.
In the next part of this series, we will configure Notebook Servers to perform MLOps. Stay tuned!