Microsoft sponsored this article, which was written by The New Stack.
Containers and the Kubernetes orchestrator are emerging as core technologies for delivering distributed training at the scale that effective machine learning requires to be reliable in production use cases.
Massive labeled datasets and the tools to work with them have driven the rise of machine learning from obscure research technique to mainstream technology. Years of gradual improvements in algorithms have been bundled into frameworks like TensorFlow and PyTorch, and using the massive parallelism of GPUs speeds up the process of training those algorithms. Containers have their own tradeoffs but with the right tools and cloud scale, they’re showing real promise in scale-out architectures to support distributed machine learning environments.
“Containers are a great way to package and distribute machine learning models and Kubernetes is a great way to orchestrate and scale the training and serving of those models,” points out Lachlan Evenson from Microsoft’s Azure Containers team. “Kubernetes makes commodity NVidia GPUs a schedulable resource; you can say ‘I want this many of this type of that GPU’ and have them scheduled in Kubernetes.”
The Kubernetes primitives may not be particularly intuitive for data scientists to work with but open source tools like Kubeflow make it easier for them to take advantage of Kubernetes to train, scale and serve data models without needing to become infrastructure experts.
But scaling up Kubernetes to massive cluster sizes with GPU nodes can be prohibitively expensive on your own infrastructure. Even with Kubeflow to simplify the job of the data scientists, making Kubernetes efficient at scale is still complex enough to demand infrastructure expertise somewhere in your organization. Unless you have that deep expertise or are acquiring it to run other microservices workloads, cloud container services with GPU nodes like Azure Kubernetes Service are the logical next step.
Scaling Kubernetes Takes Experts
OpenAI chose Kubernetes for deep learning research because it allows for fast iteration cycles and it scales to the size of the organization’s largest models — their largest cluster is over 2,500 nodes. But even a research foundation funded by leading Silicon Valley luminaries with its own data centers doesn’t have the resources to scale its infrastructure to support such large models. Instead, the cluster is on Azure where the resources are available to run the clusters efficiently.
Getting Kubernetes to that scale, even on Azure, required OpenAI to develop deep expertise in managing the orchestrator. When their first cluster passed 500 nodes OpenAI had to switch the etcd cluster storing state for the Kube masters to using directly connected SSDs to avoid latency problems; once it passed 1,000 nodes they had to reconfigure their logging and monitoring, increase the storage on the etcd cluster and move the Docker root to SSD, finetune KubeDNS, manage the Docker image pull pipeline aggressively and reconfigure networking.
With large data sets, training takes a long time, leaving data scientists sitting and waiting. “They couldn’t get enough infrastructure in their own data centers to keep all their data scientists working efficiently, so they scaled their Kubernetes cluster out to the cloud,” Evenson explains. But just shifting to the cloud doesn’t automatically give you efficient distributed training. Doubling the resource available doesn’t result in linear scaling without some extra work, because many machine learning models weren’t designed for distributed training. You may need to consider less familiar tools like Horovod, a framework developed at Uber that plugs into TensorFlow, PyTorch or Keras. Horovod improves model scalability on a distributed system like Kubernetes with fewer changes to the model than using a parameter server.
As machine learning models go through each batch of steps in their training, the parameters of the model get adjusted; with multiple workers in distributed training, a parameter server collects and distributes the updated parameters back to all the workers but batching and distributing those changes is very network and CPU intensive and stops performance scaling as you add more GPUs and workers. Horovod uses NVidia’s NCCL protocol to communicate directly between GPUs without needing a parameter server, which means workers can distribute changes to the model over time without that becoming a bottleneck.
Distributed training requires a lot of network bandwidth in general; improving container overlay network communications using RDMA and the FreeFlow CNI plugin instead of leaving it up to the OS kernel to batch up and transfer data lowers latency and improves network throughput without increasing CPU usage, which also helps improve scalability.
Scaling Distributed Learning on Azure Cost-Effectively
Kubeflow, Horovod and Freeflow all work with Azure Kubernetes Service using Nvidia VMs and can significantly improve scaling for common deep learning models like ResNet, VGG-16 and Inception v3. “The goal is to make it easier for data scientists to get on an open source platform and to develop their machine learning models, so you want very lightweight, scalable infrastructure under the hood, you want models to scale and infrastructure to scale linearly so they don’t have to pay a lot more to get more work done.”
But using a VM-based Kubernetes service like AKS doesn’t address all the issues with machine learning training at scale in the cloud. “When I kick off a training job I need a lot of resources and when I’m done I don’t need any of that,” Evenson explains. “Even though autoscaling in Kubernetes is great, you’re still waiting for a VM to spin up and spin down and the cost associated with that. In the VM model, you either overprovision and have expensive GPU clusters sitting around idle or take the hit of scaling so that when you drop a job on the cluster it can take several minutes for every individual VM to be up and ready to service the load.”
Azure Container Instances avoids that delay by delivering on-demand containers, and it will soon begin to add GPU support. “ACI will enable workloads to be spun up very quickly; because you’re not bound by VM spin up and spin down time, you can leverage a cluster to spin up a lot of nodes in a shorter time and speed up delivery.” As well as having the model ready to train more quickly, that gives you even more scalability, Evenson points out.
“Because GPU nodes are very costly, people only want to have them when they’re processing data, and they want to know how efficiently they’re running. Am I filling the GPUs or are they sitting at half capacity so I could run half the fleet that I have? ACI can scale really quickly to deliver the batch workload and then scale back down once the workload is complete.” ACI also offers per second billing rather than per minute VM billing. And because you plug into ACI using Virtual Kubelets that behave like standard Kubernetes kubelets, no code changes are needed Evenson says, “but the back end is running in nodeless Kubernetes rather than being bound to a VM.”
Machine learning frameworks have democratized the tools for building deep learning models, and an ecosystem of tools is turning Kubernetes into an effective way to operationalize that without needing data scientists to become infrastructure experts. Kubernetes services in the cloud are making the scale needed to work efficiently with the largest datasets available to everyone.
Feature image: Horovod scales much better to distributed learning models than traditional distributed TensorFlow (benchmarked on AKS). Image provided by Microsoft.