For Pepperdata, Spark-on-Kubernetes Is the Ticket off of Big Data Island
Formerly at Yahoo, one of the first users of Hadoop, he saw this whole Big Data stack — HDFS, MapReduce, YARN — developed to deal with problems of scale.
“Deliberately and by design, it was separate from everything else that IT could do,” he explained.
That led to this disconnect: All the things you could do with mainstream IT — all the tools, log management, storage systems — were totally inaccessible to the Big Data universe.
“So you got what I call the mainland and the island. The mainland of mainstream IT and the island of Big Data. It doesn’t share technologies, it doesn’t share hardware, it doesn’t share people or expertise, it doesn’t share tools. And it’s been that way for more than a decade. You don’t get to take advantage of advancements of one or the other,” he said.
The company maintains that Spark is orders of magnitude faster than MapReduce, easier to code, and more flexible.
As Pepperdata, which helps customers solve Big Data issues with Hadoop and Spark, began investigating running Spark and HDFS natively on Kubernetes, it found a community of companies — Google, Red Hat, Palantir, Bloomberg to name a few — working on the same issues. The Spark on Kubernetes Special Interest Group was formed as a fork of Spark. Kubernetes is expected to become core to Spark in the next release, due out in a few months.
That will give users a fourth ways to run Spark beyond standalone, YARN and Mesos.
While Kubernetes 1.8 added native support for Spark, it’s taking more work to make Spark fully speak Kubernetes, he said.
“Kubernetes already gives you all this flexibility where you can describe your pods, you have daemon sets, replica controllers, and primitives. We got a basic version working using primitives that already existed,” he said.
At Spark Summit 2017, Google software engineer Anirudh Ramanathan explained that running the two together gives operators less infrastructure to manage, it gives developers a single interface to manage all their workloads, it improves infrastructure utilization, and the huge Kubernetes ecosystem adds a host of services that can be immediately available to spark users, such as the recently launched Istio service mesh project.
Two big areas of work have been on security and scheduling.
Kubernetes has security primitives, but it doesn’t really extend to all the arbitrary users within an enterprise, he said, so there were extensions to make it work the way that Big Data systems with network authentication protocol Kerberos do.
It supports Kerberos-based authentication to secure access to the overall environment and to protect credentials used to access applications.
There’s also been work to deal with scheduling the ever-changing workload of ephemeral microservices.
The project includes a data locality function to make it faster to access data across distributed instances of HDFS on Kubernetes. It would allow users to manage all the silos where data resides, regardless of whether they are deployed on-premises or in a cloud.
“There will be Helm charts so users can set up storage on their Kubernetes system and use that as secure Big Data store. Secured and high performance. That’s been one of the traditional problems with using cluster fabric on bare metal that Big Data systems are really resource hungry. The powerful thing about Kubernetes is the abstraction to set these things up in a way that can have the same kind of performance you get out of bare metal. With most [Big Data] systems, that wasn’t really true,” he said.
Getting off the island will mean companies can just run another project on their Kubernetes cluster without a clunky, multiple system architecture. He predicts that within two years a company could run analytics — a machine learning project that feeds back into a user-facing application, for instance — all in one system.
Focus on Spark
Pepperdata has intensified its focus on Spark. Features like easy integration, built-in machine learning and support for streaming data are driving the boost in Spark adoption, according to experts in its “Production Spark” webinar series.
The company recently announced Code Analyzer for Spark, which gives developers the ability to connect performance issues to the blocks of code causing the problem.
In March, it released Application Profiler, a software-as-a-service version of LinkedIn’s Dr. Elephant, the open source tool that helps users of Hadoop and Spark analyze and improve the performance of their flows.
One new option for unifying Big Data with other IT infrastructure comes with SAP Vora, an in-memory, computing engine for HANA that runs on Red Hat’s Kubernetes-based OpenShift Container Platform. It gathers data from Spark, Hadoop or directly from cloud environments.
At the same time, it seems Hadoop is falling out of favor.
Earlier this year, Gartner declared Hadoop obsolete, citing the complexity and questionable usefulness of the entire Hadoop stack. It noted many organizations are instead looking at cloud-based options with on-demand pricing and fit-for-purpose data processing.
Cloudera removed Hadoop from the formerly named Strata + Hadoop World conference.
Hortonworks similarly renamed its Hadoop Summit line of conferences last year to DataWorks Summit to reflect the greater role of data-streaming architectures.