Tutorial: Install Kubernetes and Kubeflow on a GPU Host with NVIDIA DeepOps

This post is the second in a series of articles exploring the Kubeflow machine learning platform. Check back each Friday for future installments. (Part 1)
In this post, we go through the installation process of Kubeflow, an open source machine learning platform that takes advantage of Kubernetes capabilities to deliver end-to-end workflow to data scientists, ML engineers, and DevOps professionals. The testbed configured in this tutorial will be used for exploring the building blocks of the platform covered in the future installments of this tutorial series.
Kubeflow can be installed on any Kubernetes cluster that has a minimum of 4 CPUs, 50 GB storage, and 12 GB RAM. It can be installed on managed Kubernetes services such as Amazon Web Services‘ Elastic Kubernetes Service (EKS), Azure Kubernetes Service, Google Kubernetes Engine, and IBM Kubernetes Service. Kubeflow can also be installed in on-prem environments running Kubernetes on bare metal hosts. Refer to the Kubeflow documentation for details on the installation.
For this tutorial, we will use the DeepOps installer from NVIDIA which simplifies the installation process. In about 20 minutes, we will have a fully configured Kubeflow environment available for us. NVIDIA has created DeepOps primarily for installing Kubernetes on a set of hosts with GPUs. But, it can also be used to target non-GPU hosts.
I recently built a custom machine for experimenting with AI. Based on AMD Ryzen Threadripper 3990X CPU with 64 Cores, NVIDIA GeForce RTX 3090 GPU with 24GB and 10496 CUDA Cores, 128GB RAM, and 3TB of NVMe storage, it is a powerhouse. This is a perfect candidate for running a single-node Kubernetes cluster backed by NVIDIA drivers and CUDA Toolkit for GPU access.
I found DeepOps as the perfect tool to install the combination of Kubernetes and Kubeflow on this machine to configure a single-node Kubernetes with GPU access. DeepOps also installs other optional components such as dynamic NFS provisioner, Ceph/Rook, and Prometheus with Grafana.
If you don’t have a GPU machine, you can choose to install Kubernetes first followed by Kubeflow. In my experience, NVIDIA DeepOps works well even in non-GPU environments.
Preparing the Host GPU Machine for DeepOps
If you are relying on NVIDIA DeepOps to install Kubernetes and Kubeflow, you don’t need to install anything other than the supported OS and NVIDIA drivers. For the testbed, I installed Ubuntu 20.04 and configured the NVIDIA 460 driver. Make sure you replace the default open source Nouveau graphics drivers with the proprietary NVIDIA drivers.
Make sure the drivers are properly installed by running the nvidia-smi
command.
Add a user with SSH key and password sudo
access. I created a user, ubuntu
, that I plan to use with the DeepOps installer.
Preparing the Bootstrap Machine
Since NVIDIA DeepOps relies on Kubespray and Ansible, you need a bootstrap machine to run the playbooks. This can be an Ubuntu VM that has access to the target host machine. I used an Ubuntu 18.04 VirtualBox VM running on my Mac as the bootstrap machine.
Make sure you can SSH into the GPU host without a password which means you need to have the SSH private key available on the bootstrap machine.
Start by cloning the DeepOps GitHub repository on the bootstrap/provisioning machine.
1 |
git clone https://github.com/NVIDIA/deepops.git |
Switch to the most stable version of the installer.
1 2 |
cd deepops git checkout tags/20.12 |
Install the prerequisites and configure Ansible.
1 |
./scripts/setup.sh |
Next, update the inventory file with the GPU host details.
1 |
vim config/inventory |
Under [all]
, add the hostname and the IP address. I am calling my host ai-testbed
with an IP address 172.16.0.30
Add the same host under the [kube-master]
, [etcd]
, and [kube-node]
sections.
If you have a multinode cluster, you can split them as control plane and worker nodes in the inventory files. Since I am building a single-node GPU cluster, I have the same host playing all the roles of the cluster.
Note that Kubespray will rename the host(s) based on the inventory file. It’s not recommended to change the hostname after installing Kubernetes.
Now, we are ready to kick off the installation.
Installing Kubernetes with DeepOps
Now that the host and the bootstrap machines are ready, let’s run the installation.
It starts with a single command shown below. I had better results when I used the CUDA repo to install the NVIDIA CUDA runtime and cuDNN libraries. The other option is to install the runtime through the GPU operator.
Run the command to ensure that the installer uses the CUDA repo to configure the runtime.
1 |
ansible-playbook -u ubuntu -l k8s-cluster -e '{"nvidia_driver_ubuntu_install_from_cuda_repo": yes}' playbooks/k8s-cluster.yml |
This will start the Ansible playbook to install Kubernetes. If you are familiar with Kubespray, you will find that DeepOps is based on the same installer.
Wait for the installer to finish the installation. It may take anywhere from 10 to 20 minutes depending on your Internet connection.
Copy the configuration file and kubectl
to access the cluster.
1 2 3 |
cp config/artifacts/kubectl /usr/local/bin/ mkdir ~/.kube cp config/artifacts/admin.conf ~/.kube/config |
We are now ready to access the single-node Kubernetes cluster.
1 |
kubectl get nodes |
Let’s test if Kubernetes is able to access the GPU.
1 2 |
export CLUSTER_VERIFY_EXPECTED_PODS=1 ./scripts/k8s/verify_gpu.sh |
1 |
kubectl run gpu-test --rm -t -i --restart=Never --image=nvcr.io/nvidia/cuda:10.1-base-ubuntu18.04 --limits=nvidia.com/gpu=1 nvidia-smi |
Configuring the Kubernetes Cluster
Let’s go ahead and install NFS provisioner, Prometheus, and Grafana.
The below command ensures that NFS is used as a backend for dynamic provisioning of PVCs. It’s also useful to share volumes between Jupyter Notebooks.
1 |
ansible-playbook playbooks/k8s-cluster/nfs-client-provisioner.yml |
This step also configures a default storage class which is essential for Kubeflow installation.
1 |
kubectl get sc |
To install Prometheus and Grafana, run the following command:
1 |
./scripts/k8s/deploy_monitoring.sh |
You can access the Grafana dashboard at http://gpu_host:30200 with username admin
and password deepops
.
Here is the dashboard showing the GPU node statistics.
Installing Kubeflow
Finally, we are ready to install Kubeflow. Just run the below command and wait for a few minutes to access the UI.
1 |
./scripts/k8s/deploy_kubeflow.sh |
After the installation is done, make sure that all pods in the kubeflow
namespace are running.
1 |
kubectl get pods -n kubeflow |
You can access the Kubeflow dashboard at http://gpu_host:31380
Congratulations! You have successfully built a Kubernetes cluster running Kubeflow.
In the next part, we will explore Kubeflow components. Stay tuned.
Amazon Web Services is a sponsor of The New Stack.
Feature image by PIRO4D of Pixabay.