Data / Development / Machine Learning / Sponsored / Contributed

Optimizing Compute in the Post-Hadoop Era

25 Aug 2021 11:00am, by

In my recent CIO.com post, “Is There Life After Hadoop?”, I wrote about the post-Hadoop era and two key strategies that organizations can deploy to help them transition. These strategies are: 1) Build a better lake, and 2) Optimize the compute.

I’ll expand on building a better lake in a future article, but today I want to focus on the compute part of the equation. As I wrote in my previous article, Apache Spark’s flexibility, columnar approach to data, suitability for artificial intelligence (AI) and machine learning, and its vastly improved performance over Hadoop have all served to dramatically increase its adoption in recent years. For most users, it has become the logical successor to Hadoop MapReduce. This article addresses how to get the most from Spark and help ignite a revolution.

Why Spark?

Randy Thomasson
As a global solution architect for HPE Ezmeral software, Randy provides technical leadership, strategy and architectural guidance spanning a wide range of technologies and disciplines, including application development and modernization, big data and advanced analytics, infrastructure automation, in-memory and NoSQL data technologies and DevOps.

Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer and a physical execution engine. It also has a strong affinity with Hadoop and uses many Hadoop libraries. Spark also powers a stack of its own libraries, including SQL and DataFrames, Spark Streaming, MLib and GraphX. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object.

Spark clusters are well suited to tackle the needs of today’s data-driven business, as its support for streaming and in-memory processing can yield significant performance improvements over more batch-oriented technologies like Hadoop.

Spark’s cluster-based architecture also makes it well suited to handle a wide range of data sets. Moreover, Spark provides multiple deployment options and directly supports four different cluster managers:

  • Standalone – a basic cluster manager included with Spark for simple, easy-to-run clusters.
  • Apache Mesos – an open source cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN – an open source resource manager that is included in Hadoop 2.
  • Kubernetes– an open source system for automating deployment, scaling, and management of containerized applications.

The good news is that the underlying cluster manager is transparent to Spark applications. So choosing a different cluster manager doesn’t require changes to Spark applications, only the deployment configuration.

Choosing the Deployment

The choice of Spark cluster managers varies somewhat, but most organizations have traditionally run production Spark workloads using Hadoop YARN. However, momentum is shifting to running Spark on Kubernetes. This is true for several reasons:

  • Standalone is limited – It’s the easiest to start with, but is best suited to single-node development/ test clusters. It lacks the dynamic management capabilities of the other three cluster managers, and in today’s container-based, virtualized infrastructure environments, it lags behind more advanced technologies like Kubernetes.
  • Mesos is dead – Just a few months ago, it looked like Apache Mesos was headed for the Attic, the place where Apache projects go to die, but at the eleventh hour it was granted a reprieve. That said, activity in the community has slowed dramatically, with only a single release in 2020 and none so far in 2021. Mesos adopters include marquee names like Apple, Twitter, Netflix and Uber, but it never gained critical mass and didn’t make it into the mainstream like Hadoop or Kubernetes.
  • YARN is yesterday YARN has historically been the most popular Spark cluster manager for a variety of reasons. However, unlike Mesos and Kubernetes, YARN is a relative newcomer to containers (as recently as Hadoop 3.1.1 it was considered experimental and incomplete) and as an integral part of Hadoop requires either a Hadoop cluster to run or a means of deploying YARN independent of a Hadoop cluster (e.g., as a KubeDirector application). The state of container support and extra Hadoop baggage translates to higher life-cycle costs and makes it less attractive for new Spark deployments.

Given today’s data-driven business processes with shared, virtualized infrastructure running in complex deployments spanning on-premises data centers and public clouds, the choice for today’s production Spark workloads is clear: Kubernetes.

Kubernetes offers some distinct advantages for Spark deployments. Chief among these is its support for containers. Containers have revolutionized the way that applications are packaged and deployed much like virtualization revolutionized server infrastructure. Containers provide better isolation, improved portability, simpler dependency management and, most importantly, dramatically reduced application cycle times. Kubernetes also provides more efficient resource management, eliminating the need for transient clusters — as recommended by Databricks, EMR, etc. — to avoid resource conflicts/impacts in non-Kubernetes Spark environments. The shorter application iteration cycles and significantly less setup/teardown delays provided by Kubernetes translate to substantially lower life-cycle costs. As a result, organizations moving their Spark workloads to Kubernetes can see 50% to 75% lower costs.

Retiring the Elephant

Hadoop has had a good run over the years, but for many organizations it’s time to move on, and Spark has emerged as the tool of choice to replace it. Spark’s improved performance, affinity with existing Hadoop assets, and its more advanced approach to data have made it a popular choice for migrating Hadoop workloads. That said, Hadoop will be with us for a while. This is true for multiple reasons:

  • Migrating petabytes of Hadoop data and related applications takes time.
  • Some Hadoop services don’t have direct replacements in the Spark ecosystem yet.
  • There are still some use cases where Hadoop is a better choice.

Given this reality, organizations migrating from Hadoop need a solution strategy that provides a cost-effective home for their remaining Hadoop assets both during and after migration, while at the same time accommodating growing Spark workloads, preferably using a common, container-based platform. Ideally, the solution would support the compute and storage needs of existing Hadoop assets as well as newer Spark workloads while minimizing both the number of runtime platforms and associated storage.

Igniting a Revolution with Spark

Recent years have seen an explosion in AI and data-driven applications. This in turn has driven the migration from Hadoop and fueled the adoption of Spark and machine learning technologies.

As organizations look to migrate existing Hadoop data and applications, they need an approach that will allow them to effectively manage their shrinking Hadoop investment, while at the same time increasing their investments in Spark and machine learning technologies. The best way to do this is to embrace a Spark-plus-data fabric strategy for analytics.

By adopting HPE Ezmeral, organizations can ease their transition into a post-Hadoop era, optimizing analytics compute functions with Spark while effectively managing legacy Hadoop assets in the process.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: MADE.

Lead image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.