The rise of machine learning and artificial intelligence have put Nvidia on a roll. With GPUs becoming more important than ever, the chip maker is firing from all guns. Academic institutions, large cloud providers and enterprises are all relying on Nvidia’s GPUs for running ML and HPC workloads.
Despite the popularity and demand of GPUs, installing, configuring, and integrating an end-to-end Nvidia GPU stack is not easy. It all starts with the installation of the CUDA and cuDNN drivers.
CUDA acts as an intermediary to program the GPUs. For deep learning jobs, developers need cuDNN toolkit — an abstraction of deep neural network libraries over CUDA — to delegate mathematical computing part of neural networks to the GPU. The installation and configuration experience is not very smooth. A minor version difference of any of these layers can break the configuration. Add the version incompatibilities and dependencies of deep learning frameworks to the mix, and it becomes a mess.
To ease this process, Nvidia has turned to containers. It has integrated a runtime with Docker that’s specific to GPUs. Nvidia-Docker exposes underlying GPU infrastructure to containers. With the solid foundation laid right at the container runtime, Nvidia has expanded its platform to Kubernetes. CUDA and cuDNN can be accessed from Kubernetes Pods to run training and inferencing at scale. Finally, Nvidia has also built its own container registry that contains official images for mainstream deep learning frameworks. Developers can extend these container images to build their own images.
Let’s take a closer look at these investments from Nvidia:
I discussed the architecture and installation of Nvidia-Docker before. As mentioned earlier, this is the foundational aspect of Nvidia’s investments in containers. Recently, Nvidia has updated the runtime with support for latest version of Docker. The Nvidia-Docker runtime can be easily installed on any Linux machine equipped with a GPU and Docker engine.
Once configured, containerized workloads gain access to the underlying GPUs. Below is a screenshot of a container accessing a Nvidia Quadro P4000 GPU.
Nvidia and Kubernetes
The ability to run containers on GPU is a great first step. The real value of this integration is realized through scalable ML workloads running on Kubernetes.
At the recent Computer Vision and Pattern Recognition (CVPR) conference, Nvidia released new software, Kubernetes on Nvidia GPUs Release Candidate. According to the company, Kubernetes on Nvidia GPUs lets developers and DevOps engineers build and deploy GPU-accelerated deep learning training or inference applications on multi-cloud GPU clusters, at scale. It enables the automation of deployment, maintenance, scheduling and operation of GPU-accelerated application containers.
There are two different flavors of Kubernetes on Nvidia GPUs — one for cloud service providers, and other for servers and desktops.
I got a chance to set up a Kubernetes cluster on an Ubuntu 16.04 machine powered by a humble Nvidia GeForce GTX 1050Ti. The initial impression is pretty good and encouraging. Excluding the time it took to install the drivers and toolkits, I have a single node cluster up and running in less than 10 minutes.
The overall process of installing and configuring a Kubernetes cluster is not very different. Nvidia has modified kubeadm, kubelet, and kubectl binaries to support GPUs. Obviously, each node participating in the cluster should have a GPU attached. I will share my experience and gotchas of setting this up in a separate post.
Below is a screenshot of Kubernetes on Nvidia GPUs in action.
Nvidia GPU Cloud
When I first heard about the Nvidia GPU Cloud, also announced last week,I thought Nvidia is getting into the game by launching its own public cloud. But with the GPU Cloud, Nvidia actually meant a curated and well-maintained Docker registry of deep learning framework images.
According to Nvidia, the GPU Cloud is a catalog of fully integrated and optimized deep learning software containers that can run on Nvidia GPUs. These containers are delivered ready-to-run, including all necessary dependencies such as Nvidia CUDA Toolkit, Nvidia deep learning libraries, and an operating system.
The registry has up-to-date images for popular deep learning frameworks for such as Caffe, TensorFlow, CNTK, and MXNet. Developers and data scientists can sign up at Nvidia to get an API key to pull container images.
Major cloud platforms including AWS, Azure, and GCP have pre-configured deep learning images that can be launched as GPU instances. It only takes a few minutes for the instance to become available. Once Nvidia-Docker is installed within the VM, users can pull appropriate container images from Nvidia GPU Cloud. Custom images that extend Nvidia images can be pushed into the GPU Cloud.
From container runtime to registry to orchestration engine, Nvidia is making the right move to make GPUs software accessible to developers.
Feature image via Nvidia.