Engineers from the Facebook AI Research team took to the stage at the recent NVIDIA GPU Tech Conference held in San Jose to explain the technology infrastructure they use for their work. Howard Mansell, Engineering manager, and Soumith Chintal, AI Research Engineer, packed the session with deep geek knowledge.
“Our mission is to advance the field of machine learning or machine intelligence,” said Mansell. The team engages in research, publishes, and participates in conferences and workshops, all while focusing on the longer-term academic problems surrounding artificial intelligence (AI). They are active in researching computer vision, natural language understanding, speech, machine translation, unsupervised learning, and reinforced learning. “Pretty much everything,” he said. “We believe this approach will progress AI faster.”
Most of their research is centered around training on networks, but they do collaborate with the applied machine learning team to productize the research. Two examples are a new translation product using deep learning and Image Classification, which are rules to describe what’s in an image you post to Facebook.
Their researchers need the best and most flexible hardware and software so they can move fast, said Mansell. Projects change on a daily basis so flexibility is key. And it’s important to have a really well-managed system.
As you can image, all of this takes an enormous amount of computing power.
Training data on networks is as much an engineering problem as it is a mathematical one, he said. Mansell’s team works on a scale most of us can’t even imagine. Their training model has all the weights and biases on the Facebook network, and that adds up to millions of parameters. They fit those tens of millions of parameters into a large dataset, which ends up holding tens of terabytes. So they’re computing in what he called “a million dimensional” space. The mind boggles.
Mansell laid out the basic requirements for deep neural net (DNN) training:
- Create a model with millions of parameters.
- Train on multi-TB datasets.
- Computer Vision model requires 5-100+ Exa-FLOPS (Floating Point of Operations per Second) of compute along with billions of IOPS (Input/Output Operations Per Second).
- Remember that gradient descendant algorithms are sequential.
The algorithms are inherently sequential, he said, so “you can’t just throw tons of CPUs at it and hope it will scale. We have to think carefully about the interconnect and where the bottlenecks are.”
It helps that they are building on Facebook’s massive scale computer resources and a very scaled infrastructure that support the Facebook, Instagram, What’s App products.
Aside from requiring extreme performance, they also need extreme flexibility. Because their research projects change daily, there’s no time to optimize infrastructure for a particular product or project.
The main component of the FAIR hardware cluster is the DGX-1.
The DGX-1 cue mesh is two interwoven rings with bi-directional, 16-gigabyte-per-second bandwidth. Mansell said this achieves over six gigabytes per second throughput for all reduced for a single system.
The most important component there is the 8 Pascal P100 GPU, he said. Those are interconnected with NVLink, giving them a four 16 gigabit per second bi-directional mix. The stack also includes a couple of powerful CPUs and a good PCI express topology. This allows them to feed the data into the GPUs as fast as needed to keep up with the increasing power of GPUs.
In order to create the lowest possibility of latency with the highest possible bandwidth, Mansell said they bypassed the DGX-1 by having an NDlink. This means data transfers between GPUs can go through fast bi-directional links, allowing them to optimize the PCI express topology for feeding data to the GPUs.
They also maintain one infinity band interface card for each pair of GPUs. Each card has 100 gigabits per second bandwidth, he explained. They also use an IDMA, to transfer data from GPU in one machine to a GPU in another without involving a CPU kernel.
The computer platform system used in production has been open-sourced, he said. This has the same data board that contains GPUs in the same key mesh architecture.
The DGX-1 is a great system for deep learning, he said. “We built air class ZX out of 128 DGX-1s. That gives us ten-and-a-half petaFLOPS total, for 32 floating points and compute, and twice that 16 bit.”
Because they use multiple DGX-1s for training a single module, he said, they build out a non-blocking IB network fabric. This does no logging, which means no bottlenecks, he explained. “I can take any ordinary pair of DGX-1s and communicate between them at 400 gigabits per second. All pairs of DGX-1s can be doing that simultaneously. They’re not going to bottleneck at the network.”
Another two important components for that cluster are development servers and shared storage. Researchers interact with the computer on a laptop, Mansell said, so they built a mini DGX-1 using a dual quad row GP 100 GPUs that have N gigalink between the GPUs.
Storage is often neglected in clusters, he said. “You basically need to serve the data from SSD or RAM, because there’s a random access pattern and high-throughput requirements, so a Pascal GPU can consume hundreds of images per second.” This would peak at a thousand images per second for quite a simple model. “If you scale that up to a 1,024 GPUs, obviously you have a lot of file idles going on,” he explained.
Because of the random access pattern in a large data set, caching is going to be affected. Once you get off of a local disc, you can run into problems, he said. They built a storage system that meets iOS requirements for the entire cluster without independent caching. The system supports roughly 150,000 file rates per second per 100 terabytes of storage, he said, and they can scale the iOS at the same time as storage. “We’re going to shard the datasets between the different ranks so that we don’t get hotspots even if we use relatively small amounts of data,” Mansell said.
And Then There’s Software
The FAIR team uses two frameworks, both open sourced and maintained by Facebook, High Torch and Caffe2. Each is tuned for different purposes.
High Torch is designed for research applications. It has a new front-end in Python, and it uses an approach called define-layer-execute. “What this means is that you write simple imperative platform codes or perform your full computation on your network, and it’s a side effect that constructs a dynamic graph of the computation and we can use that to computer the region of active pass,” he said.
Caffe2, in contrast, is tuned more for production workflows. Caffe2 works very efficiently on a low-end Android device in this inference, Mansell said. “If you go to your Facebook app and go to camera, there’s a new style transfer in there that’s powered by Caffe2 and will run on Python in devices and use mobile GPUs in some cases when they’re available.”
Caffe2 works as define-execute, he said. “So you statically define your computer graph and you don’t have to have a Python internal to your end production code which is important for third line applications, for example, or super scalable low legacy systems like ads.”
It’s important to have a really well-managed system, Mansell summarized. “The MVA cue mesh on the DGX-1 enables fast communication as a symphony mat. Ring based collective are typically the best way to utilize those things, and fast storage is really important.”
Feature image via Pixabay.