Far from making things easier, cloud technology has only multiplied the complexities of data processing, especially across distributed systems — a problem the startup Aljabr is taking on.
Its founders — Joseph Jacks, co-founder of the Kubernetes toolkit Kismatic; Petar Maymounkov, co-author of distributed hash table Kademlia; and Mark Burgess, author of configuration management system CFEngine — are focused on simplifying the process of building and maintaining data pipelines.
People are building pipelines based on ingesting data from a lot of different sources, whether it’s real-time, batch, click-stream data, sensor data, data they’re pulling from applications. They’re wanting to run transformation analytics, intelligence processing on that data, Jacks explained.
For developers and infrastructure people, as these pipelines are built, they’re having to contend with an explosion of the types of technology being used to process and transform the data going through those pipelines. Aljabr wants to simplify the end-to-end workflow pipeline development process for data engineers and DevOps infrastructure people.
“The problem space is really huge,” he said. “If you talk to data engineers or DevOps people, when they’re building data pipelines, that whole process is extremely complex. It’s very error-prone, you have to deal with lots of different technologies, and it’s very difficult to roll out these pipelines and maintain them because they’re technology-specific. If you’re using Kafka, TensorFlow or Spark, different data processing, different data transformation, queuing systems — typically to build pipelines across all those technologies, you have to build a lot of custom infrastructure, and you have to write a lot of code to get that done.”
To do that, generally, the cost and complexity are just too high, he said. They typically end up building isolated, fragmented data pipelines for each one of these technologies.
“You might have machine learning pipelines built on TensorFlow, a totally different pipeline for doing streaming or event-driven stuff with Kafka. Different infrastructure, different teams managing it. There’s a lot of complexity in that. You might have a totally different set of pipelines in a different part of your infrastructure doing stream processing on something like Spark.
“We’re building a way to dramatically simplify it, where you could have a repeatable pipeline for real-time data, for batch data, different paradigms across systems, rather than having to build independent infrastructure for each one,” he said.
Aljabr’s founders are focused on two basic types of pipelines:
- Build and deployment pipelines, such as in continuous delivery.
- Data processing applications, such as machine learning and training.
They’re keeping the details about how they’re going about this pretty close to the vest — “We’re still stealthy,” Jacks said — but have released one project on GitHub that provides some clues to their work. It’s a directed acyclic graph (DAG) scripting language called Ko, written in Go, and building on the lessons of Maymounkov’s experimental Escher language.
Ko exposes the Kubernetes APIs within a simple DAG model, acting as a basic assembler code for implementing generic pipelines. Ko is designed to be generic, making it highly reusable, and at the same time entirely type-safe. The Ko computing model, called Recursive Circuits, supports full type inference, concurrency and deadlock-free synchronization.
Ko is designed to be generic, making it highly reusable, and at the same time entirely type-safe. Its GitHub page mentions an in-the-works Ko compiler with the ability to code-generate an implementation in any language, and a Ko interpreter that will work with any technology available in Go and be integrated with any target language.
Jacks said the company will be releasing an open source project probably around November based on the product it has in the works.
Meanwhile, the founders are blogging more about the problem than how they’re addressing it. They state:
“Docker provided a way to make components, build them into a library, and assemble them, even repair or upgrade them, and query their histories. Kubernetes has added a new (more expensive) kind of breadboard that could actually be packaged into production, at the flick of a switch. Now the question is: can we repeat these advances for data pipelines and DAGs, in a way that the average data scientist or system administrator would find easily accessible?”
Feature image via Pixabay.