Two Peas in a Pod: Orchestrate Your Cloud Analytics Stack with Open Source

There’s no arguing the rise of containers in real-world deployments over the past few years. Containers simplify running applications in any environment and Kubernetes further transforms the way software and applications are deployed and scaled agnostic of environments. In fact, Kubernetes is increasingly seen as a key technology that enables not only easy resource orchestration in the data center but also in hybrid and multicloud environments. While containers and Kubernetes works exceptionally well for stateless applications like web servers and even completely self-contained databases like mongoDB, Couchbase and others, the stack looks a bit different in the world of advanced analytics and AI. Let’s take a look at how the stack has changed, why it adds complexity to deploy in K8s clusters and approaches that can be used to leverage the power of Kubernetes for these workloads.
Disaggregation of the Analytics Stack
What has changed in the analytical stack? Let’s travel back 45 years to 1974, System R was being built at the IBM lab — the first implementation of a SEQUEL (Structured English Query Language) now known as SQL. The architecture included a parser, compiler, basic optimizer, system buffer pool and RSS — research storage system. Overtime this stack came to be known as the relational database.

Early DB2 internal architecture with buffer pool
While a lot has changed in the data world since 1974, from systems becoming more distributed to hardware becoming faster and a ton more data needing to be managed, the core concepts designed to manage data really haven’t, the stack has just become increasingly decoupled. Historically databases were tightly integrated with all core components built together. Hadoop changed that. Compute and storage were still co-located but the entire system was highly distributed instead of being in a single or a few boxes.

High-level Database / Data warehouse architecture

The cloud changed all that. Today, the data stack that the most innovative companies like Uber, Twitter, JD.com and others are building is a fully disaggregated stack. Each core element of the original relational database management system is now a standalone layer. Storage engine options range widely from HDFS to cloud object stores to on-premises object stores. Table catalog choices range from Hive Metastore on-premises to AWS Glue on AWS. More will emerge. This buffer pool will increasingly be called data orchestration.
Today’s disaggregated data stack has separate layers of compute, data orchestration and storage that scale horizontally.
So how does this affect deployment in Kubernetes? Scaling on-demand and orchestration is a lot simpler for self-contained systems that either don’t need data (stateless) and are completely pre-packaged like operational databases like mongoDB and mysql. These databases include everything from the query engine down to the storage tier. The modern analytical stack — now disaggregated — has many different pieces that need to be put together and scaled individually.
Challenges with ‘Kubernetifying’ the Analytics Stack

The data that needs to be analyzed by the query engine or model is now just about everywhere and more and more physically siloed across different racks, different regions and different clouds. So the data lives outside the K8s cluster unlike traditional data platforms like data warehouses and Hadoop. The computational engine (compute for short), be it a powerful open source query engine like Presto (that can query data from any data source) or a model framework like Tensorflow or Spark, needs to scale out horizontally and Kubernetes significantly simplifies the orchestration by bringing up or down containers or pods.
But let’s talk about data, if the data is not co-located with the compute within Kubernetes, it needs to either be accessed remotely (meaning poor performance) or needs to be manually copied into the K8s cluster (meaning a lot more additional DevOps and management on a per workload basis). And oftentimes this will carry the burden of managing the differences between those copies which can be hard. The best solution is for data locality to be recreated in this disaggregated stack. In addition, with the lack of a distributed data layer within the K8s cluster, there isn’t an easy way to share data across jobs or elastically grow the data tier within Kubernetes as compute scales, key requirements for data-driven apps. To enable elastic computation for compute frameworks like Apache Spark, Presto, TensorFlow and others, data needs to be made elastic as well.
Compute and Data — Two Peas in a Pod
To solve the data locality, data sharing and data elasticity challenges, compute frameworks like Presto, data and compute need to sit together in the same pod, a K8s pod. This can be made possible by data orchestration technologies that move data from remote data silos into the K8s cluster bringing back tighter data locality on a per worker/executor basis. Data orchestration technologies allow for computational frameworks to drive what data is needed and that data then gets pulled from the underlying data silos into K8s into a cache that is kept in sync with the underlying system. In addition, as Presto scales up or down, the compute-friendly data tier can scale horizontally as well, enabling data elasticity.
With compute and data in the same pod enabled by a data orchestration technology, flexibility of the cloud and Kubernetes can be leveraged for analytical workloads that operate in a disaggregated stack. Using this approach, users can transform their legacy data warehouse approach to build a modern cloud native data stack built on open source technologies like Presto, a data orchestration layer, with data stored in any file or object-store.
Feature image via Pixabay.