Soda Checks to Keep Your Data in Line
There’s been a lot of talk lately about data mesh, which rather than a technology or service, is actually an organizational structure that brings ownership of data closer to those actually using it to bring value to the company, as Emily Omier explained in a post recently.
If you have a central data engineering group, how well do they really understand what are the data sets that finance needs? Or the data sets that any of the business units needs? The closer you are to somebody who understands the business problems and the requirements and has the domain knowledge, the better prepared they are to build the right set of data assets to power the right kind of use cases.
Belgian startup Soda is taking that data ownership a step further to enable the business data owners to also own data quality. Co-founders Tom Baeyens and Maarten Masschelein came at the problem from slightly different angles but recognized a common problem, and the company was born.
“There’s all these people working together to make some value out of the data that they have. And it turns out that in production, the biggest problem is actually to keep that data in a clean form. Because once you’re using data in production, then typically the engineers go and do something else, build the next product. And then it breaks down,” Baeyens explained.
There are myriad ways data systems can go wonky — it might be as simple as somebody adding a new field in Salesforce — but traditionally, engineers have to write code to create checks on data quality in production, something data analysts often lack the skills to do. The Soda team set out to change that, focusing on the needs of data analysts as well as the data engineers.
Data as Code
To that end, it released Soda Core, a framework for embedding data reliability checks and quality management into data pipelines powered by SodaCL (Soda Checks Language), a domain-specific language for data reliability.
Taking a page from the data-as-code concept, Soda Core is an open source CLI tool and Python library that enables users to use SodaCL to turn user-defined input into aggregated SQL queries. Core components include the use of dataset metadata to understand the shape and health of the data, and built-in metrics and broad check coverage that can be used to validate many data quality parameters. They include anomaly detection checks and change-over-time checks to detect and resolve issues in the data and alert the appropriate people. It’s the foundation for Soda Cloud, but also can be used as a standalone tool.
In 2021, the company released Soda SQL to help data engineers maintain reliable data pipelines in production and has gone on to build it out SodaCL as a specific language, enabling data teams to check data as code across every data workload from ingestion to consumption.
As a more human-readable language, SodaCL eliminates the need to code in SQL, meaning that everyone on a data team can define the thresholds of what good data needs to look like. At the same time, underneath it still queries SQL-based data sources.
These are among the more than 30 built-in metrics included in SodaCL:
Said Tiago Andrade, head of big data, analytics and AI at Brazilian retailer Americanas S.A., “The modern retail environment has changed, and for organizations like Americanas to continue offering the best possible commerce experience, we are reliant on AI- and ML-powered digital engines that sit behind our retail platform.
“This platform is a dynamically changing entity which needs to be managed in real time to ensure that we’re adapting to changing conditions and not suffering from errors which impact accuracy and degrade overall performance. Soda gives us the end-to-end observability we need to be more confident about the data that is feeding our engines, meaning that instead of being reactive to issues, we can take a much more proactive approach based on an entirely accurate picture of the health of our data.”
Baeyens said its users pressed the idea of a specific language for data reliability. A couple of companies had already been working on such a language.
“When you want to monitor this data in production, that means you need to build up a picture of what good data looks like, so that you can monitor for that,” he said.
“Normally, this is a terrain only reserved for engineers. They have to write code, they know how to write code, and then they have to learn the library and all that. But our focus is … expanding that also to analysts and non-technical users. So the language really allows analysts to become self-serve. They don’t have to rely on programmers anymore to write those checks. [With the language] it’s much simpler than writing code. It’s easy to read. And now a lot more people can contribute to the picture of what good data looks like.”
For instance, you can compare data sets, check the freshness of data or configure a programmatic scan to create a circuit breaker to stop the ingestion of data should a problem be detected.
It takes two inputs. One is all your data source configuration and the other is the checks that you want to do. Both are YAML configuration files.
“It’s very easy for engineers to plug in into their Airflow or orchestration tools, very early on as data comes in,” Masschelein said.
Its commercial offering is a managed cloud that includes collaboration tools, incident management, integrations with Slack and other features.
Baeyens, the company’s CTO, previously created the open source projects jBPM, a JBoss-based toolkit for building business applications to help automate business processes; and Activiti, a Java-centric business process model and notation (BPMN) engine for process automation. He also created Effektif, a cloud-based business process management (BPM) solution for process automation that became SAP Signavio Process Governance.
Masschelein, the CEO, came from data governance platform vendor Collibra, which was using Baeyens’s data tools. The two connected on a community forum, and Soda launched nearly four years ago. The Brussels-based company has grown to around 40 employees.
It counts Disney, HelloFresh, Udemy and St. Jude Children’s Research Hospital among its users and open source contributors.
Disney, for instance, contributed connectors to the Trino SQL query engine and Hello Fresh is working with the company on Spark.
“So you can use this on data frames, which is also very popular,” Masschelein said. “And then in the future, we will go in the direction of streaming as well. We’ve done some early prototyping. But we want to make sure we cover the entire landscape from streaming to Spark to all SQL sources.”