The Ins and Outs of Deep Learning with Apache Spark
Databricks was founded in 2013 to help people build big data platforms using the Apache Spark data processing framework. It provides a Spark-as-a-Platform and expertise in deep learning using GPUs, which can greatly assist in the speeding up deep learning jobs.
There are multiple ways to integrate Spark and deep learning, but there is currently no consensus on how to best use Spark for deep learning, Hunter said. That said, there are general rules across any installation process that can help developers who are new to applying deep learning to their projects.
Deep Learning Is Different from Machine Learning
First, though, it’s important to understand that deep learning has different requirements from machine learning (ML), he said. With machine learning, workloads are dominated by communication and a lot of Input/Output (I/O). Therefore, it’s practical to have machines with a lot of memory to facilitate caching operations. This reduces the amount of communication and thus increases speed.
But deep learning is typically done on a much smaller cluster. So it needs more computing power and less memory. This discrepancy can create problems for developers who don’t understand the difference, Hunter said. In deep learning, a system can replicate the neural networks of the human brain, as a way to understand patterns in large sets of data. It is especially useful for tasks such as computer vision.
But the most important thing to realize when using deep learning at scale, he said, is that most insights are specific to a task, a dataset and an algorithm. And nothing replaces experiments to see what works and what doesn’t work.
He suggested getting started with data-parallel jobs on one machine and moving to cooperative frameworks only when your data sets get too large. Multiple GPUs are harder to set up, Hunter explained, and are harder to train and harder to debug. So it’s best to start out small.
Last year saw the emergence of solutions to combine Spark and deep learning. They all take different approaches.
The most popular deep learning frameworks with Spark bindings are Caffe (CaffeonSpark), Keras (Elphas), MXNet, PaddlePaddle, and TensorFlow. Native to Spark are BigDL, DeepDist, DeepLearning4J, MLLib, SparkCL, and SparkNet.
So the next obvious question is “which solution should we use?”
There’s no right answer, said Hunter, noting that all of these solutions have different programming languages and libraries. Databricks hosts on public cloud and prefers using GPUs for computing deep learning workloads. But their customers use a wide variety of deep learning frameworks.
According to Hunter, the big question when considering deep learning is, “Do you want to be constrained by I/O or processors?” And the answer to that is determined by the size of your dataset.
If your dataset is small, say 60K images, and the images themselves are small, you can use Amazon’s MXNet. This allows you to load the images in the memory of each of the workers and work in multiple-parallel fashion.
Data Pipelines and Deep Learning
In addition to considering performance, developers need to figure out how deep learning integrates into the overall data pipeline. In general, deep learning is really only about the data verification piece of the data pipeline, said Hunter. But it’s critical to consider everything that happens around deep learning when setting it up.
There are two ways many people use deep learning, Hunter said. The first is for training, where workloads are dominated by I/O. This requires large cluster, high memory/CPU ratios. When the training is complete and the system moves into the deep learning phase, a shift is needed to move to compute intensive, small cluster, low memory/CPU.
The second use of deep learning is specialized data transformations, which is feature extraction and prediction. For example, input pictures of puppies and kittens, output pictures and labels. The deep learning transformation process is constrained by how much communication you need in order to move your data from where it’s stored to the places where computations take place, said Hunter.
At Databricks, the most common application is Spark being used as a scheduler, said Hunter. “Spark processes a lot of tasks that are running in parallel and that can maintain data individually,” he said but noted that with the data-parallel tasks, the data is being stored outside Spark.
For embedded deep learning transformations, tasks are also run in data-parallel, but the data are most commonly stored in DataFrames or resilient distributed datasets (RDDs) inside Spark, he said.
There is also the option of using cooperative frameworks that can bypass Spark in a distributed manner, Hunter said. This is characterized by multiple passes over data and heavy and/or specialized communication. He warned that with cooperative frameworks, Spark doesn’t see what the system is doing, but gets the results. So monitoring the process and troubleshooting has to be done outside of Spark.
Streaming Data through Deep Learning
The most important part of setting up deep learning is how you get the data into your deep learning system, Hunter said. You have three primary storage choices.
This first is a “cold layer,” one using the Amazon Simple Storage Service (S3) or an in-house HDFS. The second choice is local storage with Sparks on-disc persistence layer. The third, and fastest, choice is in-memory storage using Spark RDD or Spark DataFrames. When data doesn’t fit in local memory, it can be stored on the local disc. “People forget how fast it is to retrieve data from a local disc,” Hunter pointed out.
There are several things to look out for when adding deep learning, according to Hunter.
Most of the deep learning frameworks are built with Python in mind, so they usually they have a kernel written in Java or C++, but have a Python interface, he said.
One common approach is to use the PySpark library, though here deep learning ends up with a lot of bottlenecks in communication. All frameworks are heavily optimized for disk I/O. Reading files is fast, and it uses local files when it does not fit.
But there is a downside to this, said Hunter. One of the advantages to using Spark is when the data goes down, you are able to re-run computation on another machine, and you will get the same result. You lose this ability when data is stored in local memory.
So it’s a tradeoff. Quoting machine learning expert Leon Bottou, he said “Reproducibility is worth a factor of two,” meaning he would be willing to get results that were not reproducible if they were delivered at least twice as fast. “I would need some machine that is running 10x faster in order to lose the computation that I cannot see.” So it depends on how much the results matter to you. This is an area that is under exploration, said Hunter.
So for deep learning developers, important questions are: what boundary do you want to put between the algorithm and Spark? Do you want the algorithm to use Spark’s communication, knowing the tradeoffs?
“The software stack goes deep and needs very careful construction,” said Hunter. If you miss one step, you will have bad performance when you start running the system.
Turnkey stacks are starting to appear, but for now, developers have to integrate hardware, drivers, libraries and any special calls to the GPUs, he said. You will need to support multiple versions of all of these.
If you’re working on a public cloud, you need to also support a lot of different hardware along with the different software versions.
Databricks found out that some clients would go outside their recommendations and install their own GPU libraries. Then they complained that the GPU was slow. It was breaking the system because they did not properly incorporate libraries.
The most straightforward solution, said Hunter, is to package all the dependencies in a way that depends at little as possible on the underlying operating system. He suggested Docker is very convenient for that purpose.