Kubernetes / Machine Learning

Build a Machine Learning Testbed Based on Kubernetes and Nvidia GPU

27 Jul 2018 6:00am, by

Nvidia GPUs have become the defacto standard for running machine learning jobs. From entry-level graphics cards to the Pascal 100 GPUs in the cloud, data scientists are relying on Nvidia for training and inferencing machine learning models.

On the infrastructure front, Kubernetes has become the standard for running modern applications. It has evolved from running stateless workloads to transactional databases.

Nvidia has been slowly but steadily adding support for containers and Kubernetes. Today, it is possible to access GPUs from containers and Kubernetes pods. Almost all the Containers-as-a-Service (CaaS) providers expose Nvidia K80 and P100 GPUs through Kubernetes.

Even though we can access GPUs in the public cloud, nothing beats building our own GPU-based development machine running Kubernetes. Depending on your budget, you can choose from an entry-level GTX 1050 Ti series to high-end TITAN X GPUs to power your testbed.

I recently built a custom machine based on the humble GeForce GTX 1050 Ti GPU. Being a fan of Kubernetes, I wanted to run a single-node cluster to run my machine learning experiments. By no means, this matches the horsepower delivered by K80s and P100s available in the public cloud. But, this is sufficient to explore GPU-based deep learning frameworks such as TensorFlow and Caffe.

In this tutorial, I am going to walk you through the steps involved in building a GPU-backed, single-node Kubernetes cluster.

Kubernetes on Nvidia GPUs is available in preview. Please note that this configuration is not recommended for production environments.

Prerequisites

You will need to install Ubuntu 16.04 with the latest Nvidia driver for your GPU. You will then have to install and configure the latest CUDA and cuDNN software. There are many guides available on the web to help you with this step.

Make sure that the command nvidia-smi works without any errors. It should show an output similar to the following screenshot.

Once the GPU software is installed and configured, set up Nvidia-Docker. For step-by-step instructions, please follow the guide that I posted a few weeks ago at The New Stack.

When you run the command, nvidia-docker, you should see the following output:

Now that we have the prerequisites in place, let’s go ahead and install Kubernetes.

Installing Kubernetes on Nvidia GPUs

The process of installing and configuring Kubernetes on GPUs is not very different from the regular setup. Nvidia has built GPU-specific container images for Kubernetes, which will be used instead of the standard images.

We will use Nvidia-specific kubeadm, the easiest tool to install a Kubernetes cluster on GPUs.

Start by adding the official GPG keys to your machine and update the package index.

The next step will install specific versions of the Kubernetes components provided by Nvidia.

Notice how Nvidia has built parallel implementations of the kubelet and kubeadm.

Before we proceed with rest of the configuration, we need to disable the check for swap. Since we are building a development machine with limited resources, it’s not a good idea to disable swap memory. Instead, we will add a flag to the configuration file to ignore swap memory.

Edit the below configuration file add the parameter to KUBELET_EXTRA_ARGS variable.

Let’s restart the kubelet to apply the changes.

You can check the status of kubelet by using the following command. Don’t worry if the initialization of the kubelet fails. This error is due to a missing CA certificate, which is generated as part of initializing the cluster using kubeadm in the next step.

Now, we are ready to initialize the master. Run the below command and wait for a few minutes for the master to start. This step will do the heavy lifting to initialize the master.

If everything goes well, you should see the below output.

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

You should now deploy a pod network to the cluster.

Run “kubectl apply -f [podnetwork].yaml” with one of the options listed on this documentation page:

You can now join any number of machines by running the following on each node as root:

This is a confirmation that the master is now ready. Since we are running a single node, we don’t have additional nodes to be added.

Copy the config file to the default location under the .kube folder.

Because we are just setting up a single node cluster to run jobs on the master node as well, we need to tell Kubernetes to schedule jobs on the master. Run the following command to configure this option.

We have a couple of more steps to complete. Let’s install Flannel as the overlay network which is essential for the pods to talk to each other.

We are now ready to go ahead and test the cluster.

Notice that the plugin required to access the underlying GPU is deployed as a Kubernetes daemonset, which is visible through the following command

Let’s do the final check to see if we can access the GPU from a Kubernetes pod. We will run the standard Ubuntu 16:04 Docker image with a couple of additional parameters.

You can easily extend this scenario to run Nvidia DIGITS on Kubernetes to train advanced neural networks.

This walkthrough covered the basics of configuring Kubernetes on Nvidia GPU. In the future articles, I will cover how to train machine learning models at scale with containers and GPU.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.