Data / Development / Machine Learning

Acryl Data Offers Real-Time Metadata Management

3 Aug 2021 10:22am, by

One thing Acryl Data’s founders learned while working at LinkedIn and Airbnb is that metadata is constantly changing, and processes that rely on batch processing can’t keep up with today’s dynamic data workloads.

“As I was leading a bunch of these initiatives [at LinkedIn], I ended up solving this problem once, twice, thrice. And it took us like three times to get it right, I think,” said Shirshanka Das, CEO of Acryl Data.

“And the key observation there was that metadata infrastructure really has to be aligned with data infrastructure in how it’s built. So the old-school metadata infrastructure, and in many cases, a lot of current technologies in the market also, are very crawl-oriented, like you connect up into systems, and then you just try to make sense of it all. And it’s batch-oriented, but that’s not the real world today.”

He and Swaroop Jagadish, who came from Airbnb, but also previously worked at LinkedIn, set out to build a company around Data Hub, the metadata search and discovery platform that LinkedIn open sourced in February 2020.

They talked to a lot of companies that use a lot of different tools in their stack while metadata about their data is changing constantly. So you have to be constantly listening in on events, Das said.

They take a stream-first, developer-first approach to the collection of metadata.

Instead of observing what happened later, the platform instruments all the points of origin and where data is being transformed to create a comprehensive real-time metadata graph.

“And then you’ll have the ability to drive, really delightful search and discovery experiences on top of it across all of your data ecosystem,” said Jagadish, the company’s chief technology officer. “Whether it’s APIs, ingestion pipelines, warehouse, machine learning models, dashboards — everything that you care about is in this comprehensive graph that is maintained fresh.”

Streaming Plus API-First

Das was the overall architect for big data at LinkedIn. Its teams created Data Hub as well as other data projects. Jagadish was head of data platform and search infrastructure at Airbnb as it created Dataportal, and also previously worked at LinkedIn.

In addition to DataHub, the platform employs Apache Gobblin, the distributed data integration framework also created at LinkedIn. It’s event-oriented and can ingest and store any kind of large metadata models. It’s open, flexible and customizable.

Expedia, Saxo Bank, Klarna are among the users of DataHub as the framework for building their own metadata graph to connect and compile their various data sources.

With an Apache Kafka-based streaming architecture, the Acryl Data platform provides a common metadata substrate across disparate data tools. The platform is in private beta as is the company’s data catalog, an enterprise-ready SaaS product that enables data professionals to search and explore their entire multicloud data ecosystem.

With the platform’s API-first design, companies can implement DataOps practices in their data architectures, providing safe, reproducible evolutions of their analytics and artificial intelligence (AI) artifacts, according to Das. Users also can build policy on top of the metadata. Policy would be programmatic instructions given to the engine delineating predicate on the metadata that signifies a certain class of data set.

Take, for instance, a machine learning model predicting the approval of mortgages based on certain attributes that model relies on in data sets ingested from external partners.

“Things can break at any point here, and they can break in all kinds of ways: the schema of data set may change in unexpected ways or the shape of the data can change in unexpected ways. The owner of a data set may no longer be present,” Jagadish points out.

If the profile of a data set changes and that changes your machine learning features, instead of the training pipeline continuing, a workflow could be set to interrupt the process and trigger an alert that people use to act upon, he said.

In compliance and governance, similar to the way Airbnb assigned bronze, gold and silver designations to its data stack, you could create a pipeline to assign specifications on the metadata to assign privacy classifications and for machine learning models to assign confidence scores on whether these annotations are actually correct.

“Are we a compliance platform? Are we a data productivity platform? We are really a metadata substrate, and you can build a bunch of these compliance-related or governance-related workflows on top of it by assigning policies to metadata,” Das said.

Scale, Maturity

Because open source DataHub has been around for several years with major enterprises using it in production, they feel confident about the scale and the maturity of the platform. The project has more than 3,300 stars on GitHub, and the number following the company on its Slack channel has quadrupled since the beginning of the year, Das said.

The company is focused on providing an easy-to-consume SaaS product and building out common workflows on top of it: How can a data scientist validate that a metric is reliable? What are the right process signals to provide in context? How to blend context into the search experience without overwhelming the user, Jagadish said.

The Mountain View, California-based company in June announced a $9 million funding round led by 8VC, with LinkedIn and Insight also participating.

“The modern data stack needs a fundamental rethink in how metadata is managed. We believe a next-generation, real-time metadata platform is needed, and Acryl Data is the best team to lead this transformation based on their groundbreaking work with DataHub,” George Mathew, Insight Partners managing director, said of the funding. (Insight Partners also acquired The New Stack in June.)

Third-Generation Architecture

Das and Jagadish consider Collibra and Alation their closest competitors, though in a blog post about the evolution of metadata infrastructure Das puts those two among the second-generation offerings, writing:

“Out of all the systems out there that we’ve surveyed, the only ones that have a third-generation metadata architecture are Apache AtlasEgeriaUber Databook, and DataHub.”

He goes on:

“The key insight leading to the third generation of metadata architectures is that a ‘central service’-based solution for metadata struggles to keep pace with the demands that the enterprise is placing on the use cases for metadata. To remedy this problem, there are two needs that must be met. The first is that the metadata itself needs to be free-flowing, event-based and subscribable in real-time. The second is that the metadata model must support constant evolution as new extensions and additions crop up — without being blocked by a central team. This will allow metadata to be always consumable and enrichable, at scale, by multiple types of consumers.”

Saxo Bank’s head of data integration Sheetal Pratik calls its use of DataHub a great fit as a modern data catalog for Saxo’s data mesh architecture.

“Its third-generation extensible architecture is already showing benefits in adoption at scale as we onboard new domains and bring in various aspects of data management into a cohesive unit,” she wrote in a blog post.

“We are seeing productivity enhancements as people are able to discover schemas in the pre-prod environment before they are deployed to production, making them aware of changes before they happen.”

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: acryl, Real.

Image by ktphotography from Pixabay 

A newsletter digest of the week’s most important stories & analyses.