The Path to Getting the Full Data Stack on Kubernetes
If you’re deploying your applications using Kubernetes, you’ve likely been including a database as part of the stack. The open source community around Kubernetes has put a ton of effort into running databases in the past few years, and it’s becoming more mainstream as a result.
Leaders in building cloud native applications have also become proficient in running stateful workloads — survey data supports this, and it’s a remarkable advantage. Kubernetes is a way we can create virtual data centers with flexibility and speed.
Given that, we should be thinking of the entire application stack in that virtual data center and how best to efficiently consume the compute, network and storage that make up the capital costs of running an application. What’s missing?
Streams on Deck
Streaming workloads are the connective tissue in data applications, and deploying in Kubernetes has been growing in popularity. But the maturity of solutions has lagged behind the push for containerized persistence. That’s a problem quickly addressed with projects and vendors to solve some of the unique issues that streaming creates.
Databases live and die on the quality of storage, and StatefulSets have tackled this problem head-on; streaming systems can take advantage of that work as well. However, because streaming is a network-oriented system, it takes a lot more coordination to deploy and maintain in a distributed and dynamic environment like Kubernetes.
I’m directly involved with the Apache Pulsar project, so I can speak about the challenges that are being addressed there. First is just the complexity that streaming can add. Pulsar is a collection of processes that coordinate, not just one single executable. Apache Zookeeper (or, soon, others) is used for metadata storage and coordination. Applications connect to the Pulsar brokers, which are stateless processes that pass reads and writes to a collection of Apache Bookkeeper nodes that serve as the persistence layer. And that’s the short version.
The system has to act in concert when deployed, which is a perfect job for Kubernetes. The Pulsar project has increasingly made Kubernetes the platform of choice and, as a result, is making decisions based on being cloud native.
Analytics Is the Next Frontier
Large-scale analytics has been a distributed systems topic all the way back to when Google MapReduce was state of the art. From the beginnings of Apache Hadoop to the more modern Apache Spark, the key feature has been coordinating many compute nodes to process smaller chunks of data in parallel.
Because this has been done for almost 20 years, different orchestration systems have come and gone to help infrastructure engineers contain the inevitable sprawl. YARN and Mesos have been the favorites for years, but times have changed, and the call is now to consolidate around Kubernetes. This isn’t just engineers migrating to the next cool thing. There are some distinct advantages to using Kubernetes, including how well it manages containers and dependencies. Security and network management are a close second, and, at the scale that some organizations use Spark, even a small gain in efficiency is a big win.
Running analytics at scale is still an early adopter proposition right now. Kubernetes operators make it a lot less one-off, but Kubernetes itself has some identified limitations that can become an issue under larger loads.
Kubernetes was first designed for stateless workloads and had very few core components that address specific pain points in analytics. By nature, analytic workloads can be highly dynamic and bursty, which Kubernetes can handle up to certain limits, but eventually, you’ll find the ceiling.
For example, a typical full-stack deployment could have hundreds of pods and, once in place, remain relatively static. A Spark job could ask for 10,000 pods and only need them for 10 minutes. Thankfully, these problems are being addressed.
New schedulers such as Apache YuniKorn and Volcano.sh directly address the limitations of existing Kubernetes subsystems to make analytics easy at any volume. Infrastructure engineers are excited for Kubernetes, and this is where we are going.
Keep Moving Forward
The bad news about Kubernetes is that not everything has been figured out. The great news about Kubernetes is that not everything has been figured out. We’re building something as an open source community for what we need. A lot of great work has been done on persistence in databases, and new standards are emerging as we move to other parts of the data picture.
Streaming is becoming a critical part of the application stack, and the maturity of running it in Kubernetes is catching up with databases. Analytics is creating a challenge for Kubernetes to find new limitations and push the boundaries of the term “large scale.” I have every confidence we’ll find our way through because we’ve done this many times before. The following 10 years of infrastructure is exciting and only going to get better.
If you are wondering if you should deploy your application on Kubernetes, don’t hesitate. I’ve given you some edges to evaluate, but most deployments will only get easier to run.
If you find a limit, join the community and help make things better for the future. The Data on Kubernetes community is a gathering place for people and organizations working together to improve the next generation of infrastructure. If you want to tell your story, Data on Kubernetes Day is right before KubeCon EU, and we would love to hear it.
You can find me over at the K8ssandra project working on cloud native database topics; please stop by and join in the discussion. While you’re there, tell me how you’re going to deploy your entire data stack in a virtual data center using Kubernetes. I’m always inspired by the smart engineers making the future, and I love learning about the new ways you’re making it work.