Data processing! How much do users need to know to get started? How much do scale and technology influence our understanding? How much does success depend on culture and deep background? Today, users from all kinds of backgrounds find that they want to use data processing for tasks that involve modeling, and that processing may involve more data than a single host can easily handle. An obvious answer is clustering and sharing technologies, like Kubernetes, but these are difficult to understand and to use. Maybe not for much longer.
In spite of rumors to the contrary, cloud technology is far from trivial or easy to adopt. Multicloud strategies — even less so, yet this is the direction we are heading in for all kinds of data processing scenarios. The question is: how can we make this easier? Specifying a coordinated process across a distributed cluster is a non-trivial problem, both in terms of definition and execution, and plenty of cloud software has developed abstractions for the latter, but the provider systems tend to simply all expose their inner complexity to users, overwhelming many users.
Kubernetes has emerged as a popular platform for component services, so it makes sense to use its generic capabilities for data pipelining and workflow management — but when it comes to data processing, a cool technology stack is simply an unwelcome intervention, not a helper. Kubernetes has some selling points: it can handle rolling updates in the pipeline software easily — at least for so-called “stateless” component designs. It’s harder for stateful components, but still more robust than unmanaged hardware. That could be an attractive proposition for business users, serving live production systems, and could easily take away a few necessary evils for data scientists in universities and research labs, if only they could get their heads around it.
Aljabr — the Fusion of Independent Parts
Enter Aljabr, a new startup founded by Joseph Jacks (formerly of Kismatic), Petar Maymounkov (Kademlia), and myself (CFEngine). We’ve fixed our sights on solving this highly central part of the puzzle. By stripping workflow concepts down to the smallest possible number, we aim to bring scalable cloud data pipelining to the masses, while using some secret sauce to cater to the enterprise. To begin with, we’re releasing a DAG scripting language called Ko, (written in Go, and building on the lessons of Petar’s experimental Escher language). Ko exposes the Kubernetes APIs within a simple DAG model, and together these build on some of the safe semantics developed for CFEngine. Ko acts as a basic assembler code for implementing generic pipelines. It’s is far from being a finished product but it is a powerful starting point, that we hope will fuel open source innovation in the space.
Trust in Industrialized Data
Pipelines are diverse. There are task-oriented pipelines (like continuous delivery build systems) and there are data pipelines (like machine learning or statistical analysis transformations), but these have a common core of features, like task coordination, data handover, tracing, parallelization, and so on. There’s plenty of scope for taking away the burden of these basic housekeeping functions by wrapping them in a meta-pipeline platform. Going beyond simple pipelines, more general “circuits” or DAGs can be assembled all the way up to cross-cloud workflows. Making all this easy seems a small thing to ask.
In the traditional view, pipeline stages are usually stateless, and focus on the work done rather than on the result achieved. They follow the “ballistic” or “pinball” view of processing, in which each stage triggers the delivery of processed data to the next stage, like a relay race. The handover assumes indiscriminate trust (and immediate availability) in all the stages, else the task may simply fail. This brittleness is a problem for a distributed system. If something goes wrong, the whole pipeline would have to be started again, wasting and duplicating previous effort.
We can easily improve on this by making pipelines and their stages stateful and redundant. When inputs arrive, they can be handed over to the next stage (responsibility for which may or may not be shared between by several agents in parallel). What’s different about data pipelines, compared to other logistic chains, is that data can be duplicated and cached. Work that has already been done can be remembered and accumulated as a “capital investment”, to be picked up at any later time and completed by any remaining stages that are pending. A pipeline is then really a sequence of operators acting on a persistent infrastructure that incorporates cloud infrastructure and accumulated data dependency infrastructure.
Another kind of processing chain that is much discussed today is the blockchain, in which changes represent payments. In executing a blockchain, multiple agents race each other to make their own competing versions of the final data, only eventually agreeing to accept a single version. A similar thing can happen unintentionally in any system of agents working in parallel. Such races could have a profound effect on the outcome of a computation particularly if it has mission-critical importance. It is critical to understand how outcomes are selected. There are many subtle issues in safely scaling data processing, and Aljabr’s vision is to bring its deep experience to bear on forward-looking solutions.
Provenance, DNA, and Forensics
Data scientists as model builders need to experiment quite a bit to get the results they need, and consumers are concerned about the quality of the output, and how it depends on the data and algorithms buried in its dependencies. Enterprise customers may actually rely heavily on such assurances. Unless log aggregation is the limit of your ambition, you will be concerned by the detailed provenance of an outcome: flexibility to make changes, but transparency in the result to compare outcomes based on composition. This is an avenue Aljabr wants to explore.
The genius of Docker was to make the kind of model that could start out like a “circuit breadboard”, then quickly be wrapped for production, with traceability. Enterprises need process observability for quality assurance and diagnosis in their complex environments. Docker provided a way to make components, build them into a library, and assemble them, even repair or upgrade them, and query their histories. Kubernetes added a smart platform to deliver into production, but not at the flick of a switch. Now it’s time to repeat these advances for data pipelines and DAGs, in a way that’s accessible to the average data scientist or system administrator.
For more information about the company and its vision, see the Aljabr blog.
Feature image via Pixabay.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.