Knowing what data a company has and where and how it’s stored has gained urgency with the enactment of the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), coming Jan. 1.
But other concerns, such as worker productivity, also come into play for organizations dealing with massive amounts of data. Ride-share company Lyft, for example, collects data on the more than 50 million rides it provides a month.
Yet data scientists spend up to a third of their time trying to find the data they need and then trying to figure out whether the data they find can be trusted, according to the company.
In response, Lyft built Amundsen, a data-discovery application on top of a metadata repository to make it easier for data scientists and others to find and interact with the data more easily. It’s named after Norwegian explorer Roald Amundsen, whose expedition was the first to reach the South Pole, and patterned after Google search. Lyft open-sourced the project in April.
Emphasis on Trust
Lyft has been growing rapidly in the volume of services provided, but also in the number of employees joining the company, leaving a knowledge gap about what data the company has, what work has been done previously on it and how up to date it is, product manager Mark Grover explained in a recent webinar.
Lyft’s data sources include both structured and unstructured data stores like Hive, Presto, Postgres and Amazon Redshift.
It faced challenges that no single data model fit for all data resources and that each stored and fetched differently.
Its requirements for the project included:
- Trust embodied in the solution — things more trustworthy show up first in the search results.
- Little manual curation — it needed to be automated.
A preference for open source. While the team considered open source projects like LinkedIn’s WhereHows and Apache Atlas, it ultimately decided the experience it wanted wasn’t out there, Grover said, and set out to build its own.
There are four parts to Amundsen:
- Crawler called Databuilder similar to Google’s web crawlers. It crawls the databases, dashboards and HR systems to determine which tables were newly created since the last run, which columns were added, who got into the system, who left the company and more. It uses Apache Airflow to orchestrate jobs.
- Search engine similar to Google’s based on Elasticsearch. It supports multiple types of search: normal, which matches records based on relevancy; category, which matches records first based on data type, then relevancy; and wildcard.
- Front-end service — If, for example, you’re looking for data on estimated arrival times (ETAs) for drivers, you type “ETA” in the search box and get a results page similar to page rank in Google search. The information there includes how commonly a table is queried, when was the table last populated and who else is using that table. If you click on the first result, it provides more data, including the schema of the table, a quick preview of the data and stats about the shape of the data such as standard deviations, means, etc.
- Graph database — It’s a metadata repository containing information about tables, people and the relationships between them. It’s built on Neo4j, but support for Apache Atlas is in the works. It also supports REST APIs for other services pushing or pulling metadata directly.
The first iteration focused heavily on tables, the work of data scientists and analysts, people using raw data sets to do analysis. It has since added a second node, which is people.
“I can go to the page of a person on the team — what tables does she own, what does she bookmark, what does she use frequently? Those conversations we used to have on Slack don’t need to happen anymore because I have that information,” Grover said.
It plans to add more nodes, including dashboards, streams, and ETL and data quality.
Metadata at the Core
Metadata is key to the next wave of big data applications, according to the company.
“We realized we were gathering all these interesting metadata that we wanted to use for data discovery and trust, but we could use it for other applications as well. What we ended up building was this data discovery application on top, but at the bottom was this metadata engine, the core of all the information that people use to power the data,” Grover said.
“If I know where all our data is stored, if I can tag all these columns as personal or private and know who’s accessing this data, then I can have a governance system based on this,’ he said of the compliance use case for the system.
Rather than manual approaches or isolating sensitive data in a separate database or location, metadata can be used to restrict access appropriately and maintain compliance, he explained in a blog post.
For ETL and data quality, using profiles of data in all the tables, users can apply heuristics to determine how the data going in today compares with that of yesterday, then set some allowable percentage of difference.
With streams, it could determine which streams are trustworthy, which map to which data sets.