Data / Kubernetes / Storage / Sponsored / Contributed

Don’t Get Stuck: Migrating to a Post-Spark on YARN World

13 Aug 2021 6:42am, by

Chad Smykay
Chad has extensive background in operations with his time at USAA as well as helping to build many shared services solutions at Rackspace, a world-class support organization. He has helped implement many production big data/data lake solutions. As an earlier adopter of Kubernetes in the application space coupled with data analytics use cases, he brings a breadth of background in the application modernization space for business use cases.

If you’ve ever been stuck in the mud in your vehicle, it’s an unsettling experience! You feel trapped. This past summer it happened to me, but luckily a simple solution exists: Add traction.

Migrating from a post-Spark on YARN (SoY) implementation to a Spark Operator that runs on Kubernetes can sometimes feel the same way — like you’re stuck in the mud. So, how do you slowly introduce the Kubernetes experience to your team and run successfully in a post-SoY world? What can you do to gain traction?

I support more than 20 different Fortune 100 customers making the migration to a post-SoY world, which allows me to see some common factors in their migrations. By simply breaking it down into a few key strategies, you will have a successful journey to a more modern implementation of Spark:

  1. What workloads have the simplest job and YARN container requirements?
  2. What workloads have the least amount of data connectivity needs?
  3. What workloads need strict compute and storage latency?

Workloads with the Simplest Job and YARN Requirements

Of course, the “low-hanging fruit” is to move those workloads that have the least-complex YARN configuration. There are many articles, blog posts and even custom calculators on how best to calculate your YARN container configurations for your workload. My favorite and my go-to is from Princeton Research Computing on Tuning Spark Applications. Ignoring the SLURM requirements, their explanations are the simplest to follow when trying to tune your Spark applications on YARN.

Figure 1. Calculating your YARN container configuration, Princeton Research Computing

In general, you can start from these two YARN configuration buckets:

  • Simple container definitions
  • Complex scheduler definitions

Your simplest YARN container definitions should be moved first as those will more easily translate to Kubernetes resource assignments (number CPU, memory, etc.). If you have more complex YARN scheduler definitions such as those used with the fair scheduler or capacity scheduler, you should move those last after you have considered how your Kubernetes resource assignment will be defined. It should be noted that a YARN implementation using a capacity scheduler more easily translates into shared resources within a single Kubernetes cluster deployment with multiple workloads.

Verifying Your Data Connectivity Needs

Part of moving to a post-SoY implementation is more freedom of choice on connecting to either your current or new data sources that Spark can use. Some common methods I see are:

  • Connecting to existing HDFS clusters.
  • Connecting to S3 API enabled storage.
  • Connecting to Cloud Object Storage providers.
  • Connecting to other filesystems using Kubernetes CSI.

Most of my customers are taking this time to update their standard’s data-access definition patterns, meaning they are defining which type of data should be stored in which type of data system/object. They are spending the time to define for which business use case or data type where it should be stored. For example, financial ticker data from a stock trade is to be stored in a parquet file format on an S3 API system and the data science machine learning workbooks are to be stored on a k8s compliant CSI filesystem. The most common being storing all data on S3 API-enabled storage such as HPE Ezmeral Data Fabric or those within a cloud provider.

Keep in mind that with Kubernetes it will give you greater flexibility in connecting to more new and interesting data sources, and those should be accounted for in your data governance policies.

Compute and Storage Latency Needs

One of the benefits with Hadoop-era workloads was its powerful combination of having your storage “next door” to your compute. Sure, early in the initial MapReduce days you had some issues with the shuffle tasks of your workload, but you could control them if needed. Part of the benefits with SoY is having that combination of compute and storage, which means for most workloads, data transfers should be reduced. When you migrate to a Spark on Kubernetes workload, you must keep this fact in mind.

A couple of questions to ask on your SoY workloads:

  1. Do I have a large data size of files or data sets that are read into your Spark jobs?
  2. Do I have a large number of files or data sets that are read in your Spark jobs?
  3. If I introduce additional read or write latency to my Spark jobs, will that affect my job time or performance?

It is important to run a sample job on your new Spark implementation being careful to note your RDD read and write times. One way to get a “level set” of base performance on your current implementation versus your new implementation is to turn off all “MEMORY_ONLY” settings on your RDDs. Why? Because if you can get a baseline of what your “DISK_ONLY” performance is, your memory-enabled RDD’s performance should be like for like, assuming you will be using the same number of resources for assignment in Kubernetes.

It is also important to note that moving to a post-SoY world means you have to revisit your security policies and monitoring system implementation to properly secure and monitor Spark on Kubernetes resources. Luckily HPE Ezmeral has a single container platform for analytics that can support you on this central security and monitoring journey to your new workload.

Recap

With these simple steps, you can create the traction you need to move to a post-SoY implementation using Kubernetes:

  • Migrating your simplest YARN configuration first, being careful to spend time on complex YARN scheduler definitions and transition those to Kubernetes resources definitions as needed.
  • Verify any new data connectivity needs in your K8s cluster as well as the security implications around them.
  • Run test workloads after separation of compute and storage to ensure you do not introduce any new latency into your jobs.

If you or your organization are struggling to start your journey on a post-SoY implementation, HPE is here to help. Check out the HPE AMP Assessment Program, a proven best practices migration methodology, to learn how HPE can help you avoid getting stuck in the mud and start you on your migration journey.

Featured image via Pixabay

A newsletter digest of the week’s most important stories & analyses.