Data / Development / Kubernetes

Meroxa Aims to Provide the Easy Button for Data Pipelines

11 May 2021 11:31am, by

After listening to customer challenges with data pipelines while at Heroku, DeVaris Brown and Ali Hamidi decided to take on those pain points when Heroku changed strategic directions. That became the basis for their company Meroxa.

“They left a huge void in the marketplace. As you’re probably familiar with, Heroku has basically become synonymous with developer experience and ease of use, and the de facto standard for how a platform-as-a-service should operate. Right? People were literally asking us for the Heroku for data, and we heard it enough. It seems to be a big opportunity for us,” said Brown, who is CEO of the new company while Hamidi is chief technology officer.

Founded in early 2020, the San Francisco-based company provides a PaaS platform including an open source data plane and a control plane with a change data capture service integrated with technologies such as Apache Kafka and a set of rules engines to automate repetitive engineering tasks and build data pipelines in minutes. It can expose a stream of data as an API endpoint or point it to a webhook.

“We’re standing on the shoulders of giants; we leverage a lot of open source,” said Hamidi.

The data plane uses components like Kafka and Kafka Connect as core parts of the streaming data pipeline along with a number of parts from the Kafka ecosystem, such as Kafka proxy, to enable some of its API-generation functionality. It builds on top of Kubernetes, to which they have contributed.

When a customer asks the platform to provision a pipeline, or create a connection, they communicate with the control plane, which basically translates that request into instructions, then ships it out to the data plane, which stands up the machinery to enable that connectivity, Hamidi explained.

Meroxa offers the ability to manage analytics and operational workflows with a single toolset, in real-time and at scale, according to Brown.

A data pipeline is a collection of components including resources, connectors, streams, endpoints and more that allow you to move your data from one place to another.

  • A resource is a database or service.
  • A connector is integration between resources. It determines how data is transferred in or out of the data stream.
  • Every source connector will produce data records in a JSON format that go into a stream. It automatically records the schema of the payload within the data record and captures its changes over time.
  • Endpoints exposed using gRPC and HTTP, allow for bi-directional communication to a stream and provide flexibility to push and pull from resources. Data records can be accessed programmatically using endpoints.

Moving data around, however, is table stakes for the company, Brown said.

“All you need to know is where your data is coming from, where you want it to go, and essentially, what format do you want it to be when it gets there?” he said. “So now, if you know that things will connect, what can you build, right?

It’s not a point-to-point offering. You can direct a real-time stream to multiple destinations.

“You have this real-time stream available, so you can actually take the same stream and point it toward a data lake, you can take that same stream and expose it as a gRPC API endpoint, you can take that same stream and point it toward your graph database. You would have had to have teams of experts to do [these things] in usual normal enterprise cases. But you can do that from a single toolset from Moroxa,” Brown said.

While it overlaps with a number of other companies in the space, such as Fivetran, Stitch and Confluent, however, Meroxa’s focus on real-time streams and exposing data to API endpoints are among the factors setting it apart, according to Brown. And while competitors tackle specific parts of the process, Meroxa seeks to be a single, end-to-end tool.

“There are many competitors, but they’re all products that kind of sit in parallel to infrastructure. Our approach is really, we want to be a part of your core data infrastructure,” he said. “We want customers and businesses to build applications, new applications, on top of the infrastructure we provide. Because we’re creating this component that basically takes the data from wherever it is, and puts it in a real-time streaming, real-time event format, so that you can then start dipping into that stream and pulling the bits that you want, and connecting it to wherever you want. So that’s a core difference in the way we’re approaching things,” he said.

Its customers use Meroxa for use cases such as:

  • Real-time data warehouse sync for analytics and dashboard visualizations.
  • Archival of raw records into a data lake for model training/active learning.
  • Programmatically listen and receive data from a pipeline for a custom service.

It’s also taking ease of use, developer experience, to heart. Its goal is to allow any software engineer or any engineer within an organization to play the role of data engineer.

“That this space is super crowded with companies, but there are many, many problems that aren’t being solved very well,” Hamidi said. “Areas like data hygiene, good quality, data lineage — I think these are all like super-interesting problems. But data transformation, data processing, stream processing — those are all areas that are still incredibly complex and very difficult to tackle at scale in any sort of manageable way. “

So the main area of focus for the company in the near future is to create the same ease of use to data connectivity, applying it to stream processing and transformation. It’s working on a product called Functions to provide a layer of abstraction to make that kind of processing very easy.

“One of the things that we learned, really the hard way, is running Kafka Connect at scale is very hard. It’s a painful process in general; managing Kafka connect at scale is hard. Writing Kafka connectors is really hard. Yes, there are a number of sort of open source components that exist. But the quality of the components really varies drastically. And so that’s the area that we’re focusing on,” Hamidi said.

The company recently raised a $15 million Series A funding round led by Drive Capital, bringing the company’s total funding to $19.1 million.

In a post about contributing to that funding round, Amplify partner Sarah Catanzaro noted that data practitioners usually need four to eight tools just to move data into and out of the data warehouse.

“It’s hard for most data practitioners to read about [real-time data applications] as they stare mindlessly at half a dozen consoles trying to untangle their burgeoning DAG [Directed Acyclic Graph] of DAGs. How can they compete if they must spend hours perusing the documentation for Django, Singer, Airflow and Docker just to power a batch-based BI dashboard?” she asked.

“Unlike other batch tools that sacrifice speed and performance for simplicity, Meroxa’s easy-to-use streaming data platform will grow with data teams as they build data-intensive applications that rival those from tech behemoths. Built upon the latest technologies in distributed systems and data management, Meroxa empowers its users to embrace a future wherein machine learning systems are online; dashboards update every second; blazingly fast services rely on multiple specialized data stores,” she wrote.

Image by Peter H from Pixabay 

A newsletter digest of the week’s most important stories & analyses.