Data / Kubernetes / Machine Learning

Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark

23 Sep 2019 11:44am, by

Kubernetes offers some powerful benefits as a resource manager for Big Data applications, but comes with its own complexities.

Speaking at ApacheCon North America recently, Christopher Crosbie, product manager for open data and analytics at Google, noted that while Google Cloud Platform (GCP) offers managed versions of open source Big Data stacks including Apache Beam and TensorFlow for machine learning, at the same time, Google is working with the open source community to make open source Big Data software more cloud-friendly.

 

“What folks tend to do, when they move from on-prem to the cloud with these Big Data stacks, is they start to piece up all the different workloads, to run those on an appropriate size cluster — or appropriate size and shape really,” he explained.

“So you might have a lot of BI or reporting applications that will try to stick onto a memory-heavy cluster, or you’ll have a bunch of machine learning jobs, you’ll stick onto these compute-heavy clusters. But piecing all that up and figuring those out,  which jobs align with each other — that can be a pretty difficult task.”

That’s why Google, with the open source community, has been experimenting with Kubernetes as an alternative to YARN for scheduling Apache Spark.

Crosbie works on Google’s Cloud Dataproc team, which offers managed Hadoop and Spark. These distributed systems require a cluster-management system to handle tasks such as checking node health and scheduling jobs. With the Apache Spark, you can run it like a scheduler YARN, Mesos, standalone mode or now Kubernetes, which is now experimental, Crosbie said.

It is using custom resource definitions and operators as a means to extend the Kubernetes API. So far, it has open-sourced operators for Spark and Apache Flink, and is working on more.

For users that don’t want to run these applications in Google Cloud, they can download a Helm chart and run their Kubernetes clusters on other clouds or on-prem.

He pointed to three primary benefits to using Kubernetes as a resource manager:

  • Unified management — Getting away from two cluster management interfaces if your organization already is using Kubernetes elsewhere.
  • Ability to isolate jobs — You can move models and ETL pipelines from dev to production without the headaches of dependency management.
  • Resilient infrastructure — You don’t worry about sizing and building the cluster, manipulating Docker files or Kubernetes networking configurations.

But there are tradeoffs, he said, outlining what he called “the Yin and Yang of going from YARN to Kubernetes”:

“It provides a unified interface if you are already moving to this Kubernetes world, but if not, this might just be like yet another cluster type to manage if you’re not already investing in that ecosystem.

Kubernetes will enable your data scientists and developers to tap into a lot of resources. If your servers are busy during the day, you can run Big Data jobs at night when they’re less busy. But if you’ve been trying to do that already with YARN, everything you’ve done with YARN will be thrown out because Kubernetes has a different way to manage resources.

Developers are going to love Kubernetes because they can start to put in all these custom configurations. But you’ll definitely be going to want to track what they’re doing. Most companies know how to do that with YARN, what to look for, what to alert on.”

“With Kubernetes, you definitely have logging, but you’re going to have to rethink what those logs actually look like,” he said.

If you have everybody might be on an older version of Spark that’s production tested, but one data scientist really wants this a new feature and the latest version of Spark, they can package that as a container running all the same infrastructure with Kubernetes and the jobs don’t have to conflict.

But for a lot of use cases, developers might find themselves dealing with something that they didn’t expect. One that often comes up is a Kubernetes network configuration to get to some data source that wasn’t part of the standard. That’s the kind of thing Google has been trying to address with Operators.

With Kubernetes, you can go from thinking about things in a cluster level, to just a particular job with assigned memory, CPU and other resources. You can really isolate those containers. But there are times you want to share data between jobs, and that can be a little more difficult in this more isolated world.

Kubernetes has a lot of really cool features, especially around security, things like the secret manager. But security also can get more complicated, he said.

“It reminds me of like one of those Russian Dolls, where you have account within an account within an account — where you have a VM running a service account, then within that there’s actually a Kubernetes service account and insides of that you have Kerberos principals,” he said, adding that tracking through all that can sometimes be a problem.

Feature image by Gerd Altmann from Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.