Even as companies are expanding roles trying to draw insight from data, a particular point of pain persists: Getting the data into one place and in a format where it can be used.
That’s according to the co-founders of Airbyte, an open source data integration platform that aims to standardize and simplify the building of connectors between data sources.
A Matillion and IDG Research survey found the mean number of data sources per organization was 400. More than 20% of responding companies were drawing from 1,000 or more data sources. More than 90% said that it was challenging to some degree to make data available in a format usable for analytics.
Tricot was director of engineering and head of integration at LiveRamp, working with a ton of data in market technology.
“That’s really where I got my hands dirty on what does it mean to build a ton of data pipelines. We were managing thousands of different data integrations, so pulling data from our customers into their infrastructure and sending it to all the different APIs for [ad and marketing technologies]. We were moving hundreds of terabytes of data every single day,” he said.
Then he moved to rideOS, a company that was doing similar work on map and traffic data.
At his first startup, Lafleur, who’s on his fourth, was building an engineering management platform on top of engineering tools.
“Instead of using six tools to get the data, you’d have everything in one. So we had to build all these ETL [extract, transform, load] pipelines. It was a nightmare. I still have scars from that,” he said.
Despite the crowded field in data integration offerings, there had to be a better way, they decided. So they set out to create one, founding the San Francisco-based company just last year.
Last summer they talked to 45 customers of Fivetran, Stitch Data and Matillion, who said they were still building connectors in-house even while using those technologies. That convinced the pair that open source was the way to go.
Airbyte looks to crowdsource the long tail of connectors since they maintain that it’s not just a problem of building connectors, but maintaining them. The ROI for a company often doesn’t make sense to maintain some obscure connector while perhaps five companies might need it. The company hopes to create a repository of up-to-date connectors that are freely available so companies duplicating work building their own. It has set a goal of providing 200 connectors by the end of the year.
Using REST APIs, users can extract data from sources such as databases, Facebook Ads, Salesforce, Stripe and move it to destinations like Redshift, Snowflake, and BigQuery.
The technology is Docker-based, allowing a developer to write a connector in any language. It provides templates to make this easier for those who prefer to work in Python or Java. Using containers makes it easier to apply updates.
Its graphical interface enables less tech-savvy users to configure operations such as creating sources, destinations and connections from the API. It employs a default Postgres database as part of the Docker service; a Temporal service manages the task queue and workflows for the scheduler. It supports Kubernetes, Airflow and dbt, a way to use SQL to execute data transformation jobs.
“All these connectors, they run as Docker images, so they can run on any system like Kubernetes or Fargate or anything that has support for running containers. And that allows you to actually run it without ever having to think about what environment it’s in. Or what do I need to make that connector work? So it’s really about optimizing it, making it very simple to use,” said Tricot.
The company recently released what it calls a CDK, a connector development kit, making building a connector a two-hour job. It professes to take 75% of the code out of the process, leaving only the need for connector-specific code. It provides a Python framework for writing source connectors; a generic implementation for writing to HTTP, including specific helpers for writing to REST APIs, GraphQL, Singer Taps and other generic Python sources; a test suite and code generator.
Decoupling EL and T
The company hopes to become the open source standard for syncing data from applications, APIs and databases to warehouses, data lakes and other destinations. In May, it announced a $26 million Series A, just months after announcing its $5.2 million seed round, bringing its total to more than $31 million raised just this year.
Tricot and Lafleur see convergence taking place between warehouses and data lakes, but contend that even the move from traditional ETL to ELT, with cloud-based storage freeing users from having to know beforehand what they want to do with data before loading, has its drawback. They insist that transformation must be decoupled from the extract and load functions because volume-based pricing prevents full use of a company’s data as cost concerns get in the way.
“The way data integration has been done, as closed source, it’s kind of falling apart in the sense that it doesn’t actually solve the integration,” Lafleur said. “The companies still have to build and maintain, plus it has a high cost of volume-based pricing. So that means you cannot replicate databases. As soon as you have millions of rows, you cannot really use those because it becomes too pricey. That’s where an open source approach can take over.”
GitLab seems to agree, having just released its open source platform Meltano as a standalone business.
So far, Airbyte is focused on the open source community edition, building a hub of connectors kept current, which the co-founders fault competing project Singer for failing to do.
Eventually,? the company will pursue an open core strategy, they said, offering an enterprise edition including data quality protocols, compliance features, role and access management, and single sign-on. A hosted version has been their most-requested feature, they said, but that’s in the future.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.