It’s Mesos’ SMACK Stack versus Kubernetes’ Smart Clusters for Hosting Spark
It seemed like a very convincing argument made by more than one proponent during the last MesosCon 2017 conference in Los Angeles: Apache Mesos, they said, was better suited to running the Apache Spark streaming data engine at large scale, than any other orchestrator. Some said Mesos was better suited to delivering a framework capable of running Spark’s parallel tasks. But nearly without deviation, they said that Mesos’ capability to manage workloads across multiple infrastructure profiles gave it the edge.
In place of “any other orchestrator” in the above paragraph, you can easily substitute “Kubernetes.”
“We had to re-architect our software, and we went through a containerization process,” said Adam Mollenkopf, Esri’s lead developer for real-time and big data, during the MesosCon session he delivered last September. “So instead of having a monolithic application, we broke it into small microservices. What was our ingestion path has been broken into things that sit there and collect data.”
Mollenkopf presented one of the key examples of the SMACK Stack at work: a group of open source components led by Spark, and supported by Mesos (more specifically, Mesosphere DC/OS), the Akka messaging framework for Scala and Java, Cassandra as the NoSQL database component (although some have already switched to MariaDB), and Kafka for messaging. To this stack, the geospatial data provider Esri added ElasticSearch as a visualization and search engine, and the Apache Play Web framework for Java and Scala applications.
“What this allows us to do is scale out any aspect of the system,” said Mollenkopf, “based on a customer need.” His argument was that it was not always practical for an enterprise to scale out an entire application environment simply to accommodate a near-term need for just one resource. Esri’s implementation is now called its Trinity managed service stack.
“It’s much easier within DC/OS — it’s a very deterministic way to deliver these brokers,” he told one questioner. “It does that through the scheduling system of DC/OS and Mesos, where it actually ran on nodes that have availability. So we don’t really have to think about, ‘This machine is for Kafka,’ or, ‘This machine is for ElasticSearch.’ We can have a general framework and can get much better utilization out of these systems. If we want to give hints to the scheduler — to say, ‘I actually want this to be a data node, and this to be an ingestion node,’ you can do that as well through DC/OS and Mesos, through what’s called a placement config. Or if I don’t want two Kafka brokers relying on the same machine, I can give it unique hosting constraints, so it only runs on one machine. I have a lot of flexibility in how we do that.”
In his interview with us at MesosCon, Astronomer vice president of engineering Aaron Brongersma told us the alternative to running the “S,” “A,” “C,” and “K” components on anything other than the “M” would have been to configure each of those components separately, including the strategies for scalability. Which would then mean, solving the problem of how to manage and scale the DNS service separately as well.
“DNS tends to be one of those services that just work,” said Brongersma, “but when they don’t just work, it is painful for everyone involved. Getting it right is very difficult. Service discovery becomes next, then as you start moving up the stack, how do we schedule our containers? All of that would have been a ton of R&D work, just getting some of that handful of components, let alone what we’ve got right out of the get-go, which is, ‘Click a button, get a Kafka. Click a button, get a Cassandra.’ We still have the ability to tune those, but most of those services work right out of the box for us, without having to go find what the best practices are.”
“I’m not sure why somebody would say the parallelism needs of Spark can’t be handled on Kubernetes,” said Sean Suchter, Chief Technology Officer and co-founder of Pepperdata, in an interview with The New Stack. Suchter was a former vice president of search technology at Yahoo, before being famously grabbed by Microsoft in 2008 to lead its Bing search technology center.
For the better part of this year, Suchter has been busy demonstrating a pipeline system that he claims not only integrates Spark with Kubernetes, but effectively makes them partners in the scheduling of analysis-intensive containers in Kubernetes pods. Kubernetes is an open source project managed by The Cloud Native Computing Foundation.
“It could be the case that what they mean is, historically, if you wanted to run Spark on Kubernetes, you had to do so in a Spark stand-alone way — which means that Spark was a layer of Kubernetes, and didn’t know how to tell Kubernetes to launch many workers, and have them communicate with each other,” Suchter continued. “You basically either had to do that all yourself — which was a really high technological barrier to set up — or your Spark app had to be one container, and therefore obviously not parallel. But with the native Spark-on-Kubernetes work — which has only existed for a little under a year, but is now fully accessible and will be in the next official release of Spark — now Spark can interact natively with Kubernetes, and it can request whatever degree of parallelism it needs, and it can do Spark dynamic allocation which allows it to dynamically change as the needs of the app progress.”
Spark already has a means with which to instruct the engine to scale the number of executors it has in use, based on its own assessed needs of the current workload. So any claim that, for instance, only Mesos may be capable of dynamically scaling one component of the stack independently, may in Suchter’s view be ignoring the last few months of work that other elements of the open source community have devoted to Spark.
Last August, the Spark community approved the posting to GitHub of a fork of the latest Spark 2.2.0 release demonstrating the substitution of Spark’s native cluster manager with Kubernetes. It was never supposed to be open-heart surgery to begin with; the fact that Spark has run on Mesos is already an indicator that Spark is indifferent on the subject.
Yet as was demonstrated more than once during the last Spark Summit, Kubernetes and Spark won’t exactly behave as “decoupled” components. In fact, they genuinely cooperate: The usual spark submit script will send jobs to the Kubernetes scheduler, which responds by scheduling a Spark-specific driver. As the driver requests pods containing Spark’s executors, Kubernetes complies (or declines) as needed. Executors in turn will run Spark tasks, just as they would before. The Spark driver will handle cleanup.
It’s a sophisticated chain of events: from the submittal process, to the driver, to the executor, and finally to the tasks themselves. But it also means that Spark containers can share the infrastructure with pods containing other tasks. The goal is that Spark can share whatever infrastructure there may be, with whatever else is already running there.
Would this goal require Kubernetes to run any special configuration? No, says Pepperdata’s Sucher.
“Part of the beauty — and this is impressive about Kubernetes — that we encountered when we implemented native Spark support on Kubernetes, was that the primitives Kubernetes gave us were rich enough to allow these big data, highly parallel, highly resource-aggressive applications to co-exist with other services that were not big data, and were not aware of these Spark containers or their semantics at all — on the same cluster, without modifying Kubernetes or those other applications.”
Has Pepperdata accounted for the growing popularity of co-processing hardware, such as FPGA accelerators and general-purpose GPUs (GPGPU), which utilize parallelization uniquely? Suchter could not respond with respect to Kubernetes on Spark. But with respect to his company’s existing release editions of its database optimizers for YARN, they already monitor how parallel data jobs run on various hardware configurations, including with accelerators. It’s pretty much a necessity, the CTO told us, in order to provide the determinism and reliability of performance that Hadoop users (in the case of YARN) expect.
Get Ready to Rumble
You’ve already seen, if you read The New Stack frequently, how supporters of multiple orchestrators, including both Mesos and Kubernetes, are supporting the development of the Container Storage Interface (CSI). It’s a mechanism for providing containers with access to persistent storage, theoretically enabling stateful connections to persistent databases or even to huge data lakes.
So you’d be surprised to learn that Pepperdata’s Suchter is arguing that such a connection may not be necessary, or even desirable, in a highly parallel scenario involving Hadoop or Spark. Indeed, he and his team have been working on something else: a way to integrate the HDFS file system directly with containers, by way of Kubernetes’ existing storage abstractions.
“With big data systems, you don’t really want to think of the data as being loaded on one volume that gets mounted in one place at a time,” the CTO told The New Stack, “and maybe gets moved around. You just want to think of the data that you want to read. This is the whole premise of a distributed file system: You don’t only have to think of the data as volumes that get used by somebody. It’s a more abstract way of thinking. Kubernetes, of course, has primitives that allow you to think about the data at either level of abstractions. So we added a very high-performance, distributed file system — HDFS, without the Hadoop part — to be a persistent storage back-end that can be used, and is available, inside of the Kubernetes system as well.”
It would be on-cluster storage, he explained, that uses HDFS semantics and that performs with Hadoop speed. But it could be treated as a file system that is, if retroactively, native to the orchestrator. With this native level of integration, he argued, you actually would not need a messaging queue to coordinate the redistribution of data between layers. “That is a value-losing way to use a messaging system,” he said, “if you are actually doing non-toy problems. You don’t want to have more intermediaries for systems that can directly talk to each other, because each intermediary — especially messaging systems, with all their context switching — can cost you a lot of performance. If you don’t need the intermediary, don’t have it.”
It’s a strong argument by a veteran (and valuable) search system developer that pokes a big hole in the case in favor of the SMACK Stack. If components cooperate at a deeper level, says Sean Suchter, you may not need much in-between them. And two interwoven elements could do the work of five in a stack. Yet with Spark 2.2.0 having only been released last month, it’s an argument that awaits hard proof.
Feature image: An 1893 incident where an exploded boiler from a steam engine landed neatly on top of another (none were hurt).