The Good, Bad and Ugly: Apache Spark for Data Science Work
Apache Spark is an in-memory data analytics engine. It is wildly popular with data scientists because of its speed, scalability and ease-of-use. Plus, it happens to be an ideal workload to run on Kubernetes.
Many Pivotal customers want to use Spark as part of their modern architecture, so we wanted to share our experiences working with the tool. This post kicks off a series in which we will relate our experiences from over a year using multiple versions of Spark on a variety of infrastructures, including on-premise, Google Cloud Computing (GCC) and Amazon Web Services (AWS). We have so far used Spark for data cleansing and transformation, feature engineering, model building, model evaluation scoring and productionalizing data science pipelines.
At Pivotal, our data science team is continuously testing the latest machine learning tools and technologies. We want tools that make it easy for us to focus on the data exploration and modeling aspects of our job and minimize data engineering, data preparation and other tedium. These goals are what have made tools like R and the PyData ecosystem, such as Pandas, so popular. Their abstractions, such as data frames, make interactive analysis easy and pleasurable. They also facilitate parallel computation on a single machine, helping us learn and iterate faster. However, as soon as the data set exceeds the capacity of a single machine, R and Pandas can no longer meet these needs.
Spark is an open-source distributed computing framework that promises a clean and pleasurable experience similar to that of Pandas, while scaling to large data sets via a distributed architecture under the hood. In many respects, Spark delivers on its promise of easy-to-use, high-performance analysis on large datasets. However, Spark is not without its quirks and idiosyncrasies that occasionally add complexity.
As with any tool, there are features of Spark we like, features we don’t like and features whose design we simply cannot comprehend. Here is a preview of the good, the bad and the ugly. We will add more detail in future posts. Along the way, we will share tips and tricks for making the most of Apache Spark.
It’s easy to see why Apache Spark is so popular. It does in-memory, distributed and iterative computation, which is particularly useful when working with machine learning algorithms. Other tools might require writing intermediate results to disk and reading them back into memory, which can make using iterative algorithms painfully slow. But that’s not the only reason to like Spark.
Appealing APIs and Lazy Execution
Spark’s API is truly appealing. Users can choose from multiple languages: Python, R, Scala and Java. Spark offers a data frame abstraction with object-oriented methods for transformations, joins, filters and more. This object orientation makes it easy to create custom reusable code that is also testable with mature testing frameworks.
“Lazy execution” is especially helpful as it allows you to define a complex series of transformations represented as an object. Further, you can inspect the structure of the end result without even executing the individual intermediate steps. And Spark checks for errors in the execution plan before submitting so that bad code fails fast.
PySpark offers a “toPandas()” method to seamlessly convert Spark DataFrames to Pandas, and its “SparkSession.createDataFrame()” can do the reverse. The “toPandas()” method allows you to work in-memory once Spark has crunched the data into smaller datasets. When combined with Pandas’ plotting method, you can chain together commands to join your large datasets, filter, aggregate and plot all in one command. Python is a language that enables rapid operationalization of data and the PySpark package extends this functionality to massive datasets.
Pivoting is a challenge for many big data frameworks. In SQL, it typically requires many case statements. Spark has an easy and intuitive way of pivoting a DataFrame. The user simply performs a “groupBy” on the target index columns, a pivot of the target field to use as columns and finally an aggregation step. After using it extensively for the past year, we find that it executes surprisingly fast and is also easy to use.
Another asset of Spark is the “map-side join” broadcast method. This method speeds up joins significantly when one of the tables is smaller than the other and can fit in its entirety on individual machines. The smaller one gets sent to all nodes so the data from the bigger table doesn’t need to be moved around. This also helps mitigate problems from skew. If the big table has a lot of skew on the join keys, it will try to send a large amount of data from the big table to a small number of nodes to perform the join and overwhelm those nodes.
Open Source Community
Spark has a massive open-source community behind it. The community improves the core software and contributes practical add-on packages. For example, a team has developed a natural language processing library for Spark. Previously, a user would either have to use other software or rely on slow user-defined functions to leverage Python packages such as Natural Language Toolkit.
No tool is perfect, so let’s review the challenges you may face with Spark.
Spark is notoriously difficult to tune and maintain. That means ensuring top performance so that it doesn’t buckle under heavy data science workloads is challenging. If your cluster isn’t expertly managed, this can negate “the Good” as we described above. Jobs failing with out-of-memory errors is very common and having many concurrent users makes resource management even more challenging.
Do you go with fixed or dynamic memory allocation? How many of your cluster’s cores do you allow Spark to use? How much memory does each executor get? How many partitions should Spark use when it shuffles data? Getting all these settings right for data science workloads is difficult.
Debugging Spark can be frustrating. The client-side type checking for “DataFrame” operations in PySpark can catch some bugs (like trying to do operations on fields with incompatible types). But memory errors and errors occurring within user-defined functions can be difficult to track down.
Distributed systems are inherently complex, and so it goes for Spark. Error messages can be misleading or suppressed, logging from a PySpark User Defined Function (UDF) is difficult and introspection into current processes is not feasible. Creating tests for your UDFs that run locally helps, but sometimes a function that passes local tests fails when running on the cluster. Figuring out the cause in those cases is challenging.
Slowness of PySpark UDFs
PySpark UDFs are much slower and more memory-intensive than Scala and Java UDFs are. The performance skew towards Scala and Java is understandable, since Spark is written in Scala and runs on the Java Virtual Machine (JVM). Python UDFs require moving data from the executor’s JVM to a Python interpreter, which is slow. If Python UDF performance is problematic, Spark does enable a user to create Scala UDFs, which can be run in Python. However, this slows down development time.
Hard-to-Guarantee Maximal Parallelism
One of Spark’s key value propositions is distributed computation, yet it can be difficult to ensure Spark parallelizes computations as much as possible. Spark tries to elastically scale how many executors a job uses based on the job’s needs, but it often fails to scale up on its own. So if you set the minimum number of executors too low, your job may not utilize more executors when it needs them. Also, Spark divides RDDs (Resilient Distributed Dataset)/DataFrames into partitions, which is the smallest unit of work that an executor takes on. If you set too few partitions, then there may not be enough chunks of work for all the executors to work on. Also, fewer partitions means larger partitions, which can cause executors to run out of memory.
The ugly aspects of Spark tend to fall into two categories: aspects of the API that are awkward or don’t make sense and a lack of maturity and feature completeness of the Apache Spark project.
Since much of the Spark API is so elegant, the inelegant parts really stand out. For example, we consider accessing array elements to be an ugly part of Spark life. While there is nothing inherently wrong with how this is implemented in Spark, it is counterintuitive given how seamlessly the DataFrame API performs in other areas. We often find ourselves wanting to store the results of a model in a Spark DataFrame. When these results include arrays of values, accessing the elements of the array is anything but straightforward. This is inherently unavoidable, as many Spark-ML functions return arrays.
Lack of Maturity and Feature Completeness
Spark has come a long way since its University of Berkeley origins in 2009 and its Apache top-level debut in 2014. But despite its vertiginous rise, Spark is still maturing and lacks some important enterprise-grade features.
To its contributors’ credit, over the last couple of years, Apache Spark has gained new DataFrames and DataSets abstractions, more algorithms in its machine learning libraries (MLlib and ML) and improved performance. Unfortunately, enterprises can be slow to upgrade to newer versions. As data scientists, we have had to cope with older versions, such as 1.6, 2.0 and 2.1, which lack important features. Keep in mind these versions are less than two years old.
Spark’s machine learning library lacks some basic features. For example, Random Forest did not have feature importance in its new ML library until Spark 2.0 (released July 2016). Gradient Boosted Trees did not expose a probability score until Spark 2.2 (released July 2017). This renders it unusable for most use cases. Even for models exposing a floating point score, an ArrayType is returned, such as [0.25, 0.75]. Shockingly, there is no built-in function to extract that 0.75. It requires a UDF, which, as described above, is slow in Python and should be used sparingly. As a result, we find ourselves falling back to training models locally using the more mature scikit-learn library.
Another example feature gap is difficulty creating sequential unique record identifiers with Spark. A sequential, unique index column is helpful for some types of analysis. According to the documentation, “monotonically_increasing_id()” generates a unique ID for each row, but does not guarantee that the IDs are consecutive. If consecutive IDs are important to you, then you may need to use Spark’s older RDD format.
In light of the good, the bad and the ugly, Spark is an attractive tool when viewed from the outside. Be aware of the gotchas before going all-in. Stay tuned for follow-up posts in this series that detail how you can make the most of Apache Spark for your data science workloads.