Kubernetes: Distributed Deep Learning on Heterogeneous GPU Clusters

Deep learning is a class of machine learning algorithm that learns multiple levels of representation of data, potentially decreasing the level of difficulty in producing models that can solve difficult problems. In the most difficult problems, this decrease is manifested as an advance in what can be done at all. In simpler problems, it may result in either an increase of performance or a simpler modeling process. A number of deep learning applications now surpass human performance such as in image recognition or playing chess or go. Other famous use cases include where part of the most capable systems use deep learning includes voice recognition, text recognition, machine translation and autonomous driving.
Much of this performance has been due to recent advances in the mathematics of machine learning, but at least a portion is due to advancement in raw computing performance and the improvement in techniques to make use that performance.
Many organizations have begun to explore methods in which they can train and serve deep learning on a cluster in a distributed fashion. Many choose to build a dedicated GPU HPC cluster that works well in a research or development setting, but a consistent problem is that data has to be moved back and forth between clusters as training data is prepared, for training and for serving. Often, there are multiple model training clusters which simply makes the problem of data motion worse. Such data motion alone is a substantial cost, but storing data in multiple locations leads to a high complexity in managing the data used to train deep learning models and managing the models between research/development and production.
In this article, we will outline some of the techniques that we use to make it easier to train and service distributed deep learning. We do this in an environment consisting of heterogeneous GPU and CPU resources. A key result is that the distributed deep learning training workflow can be made simpler. We also describe how to leverage a streaming architecture to allow global real-time deep learning applications on GPUs.
Distributed Training of Deep Learning Models
We have found the following capabilities are both necessary and sufficient for building a vastly simpler machine learning and serving pipeline:
- All data, whether stored in streams, tables or files can be organized and accessed using a traditional directory-based tree of path names.
- All data is stored in a completely distributed fashion, but can be accessed from any authorized computer using the same path names.
- All data is accessible via standard POSIX file operations.
- Volumes are available that allow directory structures to be managed in terms of where the contents of those directories are stored.
- The performance of all data access is at or very near the hardware limits. The germane limits would be network bandwidth for non-local data and local disk speeds for data that is local to a computational process.
Data Requirements
The distributed deep learning solution we propose has three layers; the bottom layer is the data layer, which is managed by data service that enables you to create dedicated volumes for your training data and provides organizations with the opportunity to put deep learning development, training and deployment closer to their data. The data layer should support enterprise features like security, snapshot and mirroring to keep your data secure and highly manageable in an enterprise setting.
The middle layer is the orchestration layer, we propose to use Kubernetes to manage the GPU/CPU resources and launch parameter server and training workers for deep learning tasks in the unit of pods. The middle layer should also support heterogeneous clusters where you can use CPU nodes to serve the model while using GPU nodes to training the mode. Advanced features would support the ability to mark nodes with different GPU cards so that you can launch task with lower priority on older GPU cards and task with high priority on newer cards.
The top layer is the application layer. In this layer support for deep learning tools such as Tensorflow are used to harness the model output and then put them into deployment. One optimal design would be to store the training data for a certain distributed training job in containers that reside in the network neighborhood of the same compute nodes to reduce the network congestion. Container technology such as Docker, and orchestration technology such as Kubernetes, need to be supported to deploy deep learning tools like Tensorflow in a distributed fashion. By the same concept, you can co-locate data for the purpose of CPU/GPU compute. With underlying data that scales horizontally, data can be continuously fed with high-speed data streaming to GPUs and prevent starving compute resources.
Deploying Applications
There are typically 5 steps to put your deep learning application into production on our proposed distributed deep learning solution.
- Modify Tensorflow application to added distributed server, there are a number of ways to enable data parallelism in Tensorflow. Synchronous training and between-graph replication is the more practical approach overall (click here for more information), in Tensorflow, we can, for example, add code snippet like:
1 2 3 4 5 |
cluster = tf.train.ClusterSpec({"ps" : "tf-ps0:2222,tf-ps1:22222", "worker": "tf-worker0:2222, tf-worker2:2222"}) server = tf.train.Server(cluster, job_name='ps', task_index=0) |
Where the ps/worker hostname, job_name, task_index could be passed in through YAML file used to launch Kubernetes pods. You can also put the code on in the distributed data system and mount it to multiple pods when launching the Kubernetes job.
- Prepare the training data and also load it onto the distributed data system, the best demonstrated practice would be to create dedicated logical volumes for different deep learning applications so it can be better managed.
- Choose the container image to use, for example, we use latest Tensorflow GPU images, these containers would benefit from native access to a fully distributed data system.
- Write a YAML file to create Kubernetes job, we want to mount the required Nvidia library, the Tensorflow application, destination folder for persistent and checkpoint and the training data used. Here we can easily create a persistent volume mounted to a distributed file system volume and grant multiple pods access to the persistent volume claim attached to this persistent volume.
- Check the result persisted and further deploy the models if results look satisfying.
Additional Challenges
Part of the challenge of real-time deep learning training and inference lies in the fast, reliable distribution of the data. With the advancement of broadcasting technologies, we will continue to increase the video quality. That also drives for deep learning technology to catch up. Techniques like compressed sensing could potentially be leveraged to reduce the load on streaming end. Deep learning models could be used to reconstruct the features of low dimensional space making compressed sensing possible on real-time video applications.
Also with distributed training, comparing with Yarn and Mesos, Kubernetes is more suitable for hosting online applications.
Feature image via Pixabay.