While “Big Data” tools such as Spark and MapReduce may offer a resilient way to spread a job out across multiple nodes in such a way that the work can tolerate the failure of a few nodes, some deep learning jobs require that each node stay running until the end of the job, car-sharing service Uber has found.
To work with this requirement, Uber has turned to gang scheduling, an optimization algorithm long-known in the field of supercomputing. Gang scheduling ensures that a cluster computing job will run only if all the nodes can be run at the same time, explained Min Cai, Uber staff engineer, during a presentation at MesosCon in Los Angeles last week.
Cai was one of the Uber engineers who implemented the gang scheduling algo in an open source framework, called Horovod, for running Google’s TensorFlow machine learning software across multiple nodes.
Uber uses the software to run training models for deep learning tasks running hundreds of GPUs, for research into guidance for self-driving cars, image classification, and fraud detection. This sort of “deep learning” training involves much feedback among nodes, so it is essential that all nodes are operational at the same time. Mesos containers were used because Docker does not yet support GPUs in upstream releases.
Identical copies of each deep learning program are installed in multiple containers, so they can be run in tandem on top of Mesos clusters. Uber uses a nested-container approach to avoid dependency conflicts, in which the program itself is placed in one Docker container, which is encapsulated by a second container with management code. With gang scheduling, Horovod places all the containers on a single host.
Uber also looked at other TensorFlow packages for running jobs across multiple nodes, but they were rejected given they imposed a learning curve on developers; it is much easier to learn the restraints of the message-passing interface (MPI), a communications library found in most all supercomputers, and one Uber uses in Horovod.
When using Horovod, the developer includes a library call within the program, and, come run-time, a software agent will launch the required number of copies to run the application, noted Alex Sergeev, Uber senior engineer, during the same presentation.
Facebook has also deployed a similar task, though uses the Caffe framework instead of TensorFlow, Sergeev noted.
Mesosphere, which manages the Mesos project, is a sponsor of The New Stack.
Feature image, Min Cai (right) and Alex Sergeev at MesosCon.