You’ve seen it everywhere… you are having major problems with your application but your IT and application performance monitoring tools have not identified any issues. The wide range of outages impacting application performance all demonstrates that there are growing problems with your data pipelines.
Consequently, data quality has become a hot topic again and new tools have started to appear. But why is this happening? Why do we need to resolve a problem that’s been around since data itself, and that already has an incumbent stack of legacy tools?
Two words: Big data.
The growth in data volume over the past 10 years has created a tectonic shift in the requirements for data quality tools — and legacy tools don’t meet them anymore.
Legacy Tools: How IDQ and Others Were Built Before Big Data
Legacy data quality tools were designed to serve a different world of data. Informatica Data Quality was released in 2001. Talend was released in 2005. Comparable tools arrived in the same window. But the world of “big data” was created by three events that arrived much later.
Event 1: The Birth of Big Data and ETL
ETL for big data began with Hadoop, which was released in 2006, but didn’t penetrate the mainstream Fortune 500 enterprise segment for another decade.
Event 2: The Birth of Cloud
Event 3: The Birth of the Cloud Data Warehouse and ELT
In Short: Legacy Data Quality tools were created long before big data arrived. As such, they were never designed to solve data quality in a big data world. While they have tried to catch up, they fundamentally do not meet the unique requirements created by the 44x increase in data volume production we’ve seen from 2010-2020.
Fundamental Mismatch: 12 Requirements Legacy Tools Don’t Meet
Big data has made legacy tools ineffective across multiple requirements, including:
- Increased Data Volume: Legacy tools often load complete datasets before analyzing them. But big data lakes and warehouses have so much data that this approach is expensive, slow, or infeasible.
- Increased Data Cardinality. Legacy tools and manual approaches were not built to handle thousands of tables with hundreds or thousands of columns each.
- Increased Data Stochasticity. Legacy tools inspect individual data integrity violations. But this is untenable and meaningless when we have so much data volume and variety, and when one small issue can break many data elements.
- Continuous Flows of Data. Legacy can’t keep pace when data arrives every hour or minute and must be used right away, and issues must be detected in near-real-time to prevent damage.
- Processing Pipelines. Legacy tools use legacy definitions of data quality. But now we have automated ELT pipelines with additional modes of failing that are unique to the setting and are not included in legacy data quality definitions.
- Changing Data Shapes. Legacy tools were designed before every organization became data-driven. But now, data is entrenched deep into the product and analytics pipeline and data models evolve as the product evolves.
- Dataflow Topology/Lineage. Legacy tools were built to run checks on a single master dataset. But we now have data pipelines with a dozen stages and many branches, which adds a spatial dimension to data quality problems.
- Timeseries Problems. Legacy tools were designed to measure data quality on a single batch of data using absolute criteria. But data now flows continuously in small batches and added a temporal dimension to data quality problems.
We have also experienced cultural changes that created their own new requirements.
- Collaboration. Data problems and solutions now touch everyone in the org.
- Consumerization. Every org now struggles with data volume and complexity.
- APIs. Platforms now need to be dev-friendly, automatable, and interoperable.
- Laws. Platforms must build architecture for security, compliance, and privacy.
These new requirements have been quietly building over the last decade, and have suddenly begun to drive new conversations around data quality for one core reason.
The Tipping Point: Why Now Is the Time to Revisit Data Quality
After a period of heavy flux in the ETL jungle, a new and stable ELT data stack has emerged. And the centerpiece of the new stack — the data warehouse — has less data integrity checks and constraints being enforced than traditional databases.
At the same time that support for data quality is thinner than before, companies depend on their data more than before. Every company is now data-driven, nobody can afford bad data anymore, and the flaws in legacy tools are really starting to hurt.
In summary, it has become painfully obvious that too much has changed, that legacy tools do not work in the new world of data, and that we need to rethink the data quality problem from a clean slate.
Amazon Web Services and Snowflake are sponsors of The New Stack.
Feature image via Pixabay.