Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements

KubeDirector, BlueData’s Custom Controller for Big Data Tasks

Oct 8th, 2018 10:16am by
Featued image for: KubeDirector, BlueData’s Custom Controller for Big Data Tasks

BlueData, which is focused on helping enterprises accelerate their big data and machine learning deployments, is releasing the source code for the first project under its BlueK8s initiative. You can find it on GitHub.

The Santa Clara, Calif.,-based company announced BlueK8s in July as an umbrella of open source projects to help bring enterprise-level capabilities for distributed stateful applications to Kubernetes. The first of those projects is KubeDirector, a custom controller for Big Data and AI workloads on Kubernetes. In the pre-alpha state, BlueData has been working on the code for about six months.

KubeDirector is a custom resource that allows a user to deploy and manage any type of big data or machine learning cluster on Kubernetes without having to modify the application.

Big data and machine learning applications like Hadoop, Spark and TensorFlow differ from the kind of microservice-based applications Kubernetes was designed for, according to Tom Phelan, co-founder and chief architect.

He has described these stateful applications as “a jumble of tightly integrated processes with interdependencies that are not well understood and whose state is distributed across multiple configuration files.” They can’t be easily decomposed into microservices without a lot of work.

Deployed into Kubernetes, KubeDirector watches for custom resources of a given type to be created or modified within some K8s namespace(s). It then uses Kubernetes APIs to create or update the resources and configuration of a cluster to bring it into accordance with the spec defined in that custom resource, the GitHub page explains.

Unlike some other custom controller implementations, KubeDirector does not tie a custom resource definition to a particular type of application or contain hardcoded application-specific logic within the controller. Instead, application characteristics are defined by metadata and an associated package of configuration artifacts.

“In Spark and Hadoop, there are configuration files that are stored in /etc or /user file directories on each node and those files contain specific configuration information for that given node,” Phelan explained. “Persistent volumes today don’t allow you to mount /etc or /user in a consistent fashion supportable across the system. So this node-specific state is not easily transferable from one container to another, which makes it challenging to run big data applications in Kubernetes today.”

So operators are another step to make it possible to run Spark and some other applications.

“People have been using Helm charts, Kubernetes Operator, stateful sets, persistent volume plans – it’s been evolving, but still not sufficient for big data and AI applications,” he said.

“Today, if I’m a Spark expert or Hadoop expert, if I want to write an operator, I also have to become an expert in Kubernetes in order to write the Go code for this Kubernetes operator.

“With KubeDirector project, we don’t require the data scientist to write any Go code. We don’t require them to become experts in Kubernetes in addition to already being experts in Spark or Hadoop or Tensorflow. They just provide a simple configuration file in YAML, and KubeDirector converts that into the equivalent of a Kubernetes Operator for that application,” he said.

After KubeDirector, the next project will focus on improving how persistent storage is implemented in Kubernetes, he said.

“Persistent volumes are great for storing data that is outside the root file system of the application. BlueData will be bringing out a mechanism for persisting data that’s inside the root file system. This is critical for running big data applications because their configuration information is stored in the root file system and in these option drives,” he said.

The company also is working on technology that improves the workflow of AI and machine learning applications. Data scientists tend to use homegrown methods to exchange AI models or schema between themselves, he said. This new technology will enable them to more easily share their AI schema and models.

Feature image via Pixabay.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.