It’s Time for Data Reliability Engineering
There was a point when software was pretty unreliable. In this 20-year-old MIT Technology Review article, one software engineer laments that good software, “is usable, reliable, defect-free, cost-effective, and maintainable. And software now is none of those things.” Fast forward two decades and businesses are run on software from payments software to CRM and everything in between.
As data transitions from a nice to have to something that businesses rely on for creating customer experience and driving revenue, data needs to undergo a similar evolution – and there are lessons that can be learned from the work already done by the software engineering pioneers that came before.
Borrowing from the principles of Site Reliability Engineering, Data Reliability Engineering gives a name to the work of improving data quality, keeping data moving on time, and ensuring that analytics and machine learning products are fed with a healthy set of inputs.
This work is done by data engineers, data scientists, and analytics engineers who historically have not had the mature tools and processes at their disposal that modern software engineering and DevOps teams already enjoy. So, data reliability work today usually involves more spot checking data, kicking off late-night backfills, and hand-rolling some SQL-into-Grafana monitoring than scalable and repeatable processes like monitoring and incident management.
Under the name Data Reliability Engineering (DRE), some data teams are starting to change that by borrowing from SRE and DevOps.
Why Is This Happening Now?
Data quality more broadly has been a topic for decades but has gotten markedly more attention within the last two years. This is driven by a few trends coming together at the same time.
- Data is being used in ever-higher impact applications: Support chatbots, product recommendations, inventory management, financial planning, and much more. These data-driven applications promise big gains in efficiency, but they can also incur costs to the business if there’s a data outage. As companies push for higher and higher ROI use cases, there’s more riding on the data, which increases the demand for quality and reliability.
- Humans are less in the loop: Streaming data, machine learning models that retrain on regular schedules, self-service dashboards, and other applications reduce the number of humans in the loop. This means the pipelines have to be more reliable by default, because there isn’t an analyst or data scientist spot-checking the data anymore — and there shouldn’t be, they have work to do!
- There aren’t enough data engineers to go around: Hiring data engineers is difficult and expensive. The demand for talent is exploding while the supply of people who can build and scale complex data platforms hasn’t kept up. This puts immense pressure on these teams to be resource-effective, avoid reactive firefighting — anything that automates problem detection and resolution, and especially tools or practices that help prevent problems in the first place.
Where Does DRE End and DataOps Begin?
Data Reliability Engineering is a part of Data Operations (DataOps), but only a part. DataOps refers to the broader set of all operational challenges that data platform owners will face. These challenges cover problems like data discovery and governance, cost tracking and management, access controls, and how to manage an ever-growing number of queries, dashboards, and ML features and models.
To draw a parallel with DevOps, reliability and uptime are certainly challenges that many DevOps teams are responsible for, but they’re often also charged with other aspects like developer velocity and security considerations.
The Tools and Techniques of DRE
While the ink hasn’t dried on the best tools and practices for Data Reliability Engineering, the seven core concepts from Google’s SRE Handbook create a strong foundation for data teams to work from.
- Embrace risk: It’s an unavoidable fact that something will eventually fail. Teams need to plan to detect, control and mitigate failures that do occur, rather than hope that they can someday achieve perfection.
- Monitor everything: Problems can’t be controlled and mitigated if they can’t be detected. Monitoring and alerting give teams the visibility they need to understand when something is wrong and how to fix it. Observability tooling is a mature area for infrastructure and applications, but for data, it’s still an emerging space.
- Set standards: Is the data high quality or not? That’s a subjective question that needs to be defined, quantified, and agreed upon in order for teams to make progress on it. If the definition of good or not good is fuzzy or lacks alignment, it will be hard to do anything about it. SLIs, SLOs, and SLAs are the standards-setting tools that can be adapted from SRE-land into DRE-land.
- Reduce toil: “Toil” is a word that describes the human work needed to operate your system — the operational work — as opposed to the engineering work that improves the system. Examples: Kicking an Airflow job or updating a schema manually. For effective Data Reliability Engineering, it’s worthwhile to remove as much toil as possible to reduce overhead. For example, tools like Fivetran can reduce the toil in ingesting data, and Looker training sessions can reduce the toil of responding to BI requests.
- Use automation: Data platform complexity has grown exponentially and managing it manually grows linearly with headcount. Which is expensive and untenable. Automating manual processes helps data teams scale reliability efforts, freeing up brainpower and time for tackling higher-order problems.
- Control releases: Making changes is ultimately how things improve, but also how things break. This is a lesson that data teams can borrow pretty directly from SRE and DevOps, code review and CI/CD pipelines. After all, pipeline code is still code at the end of the day.
- Maintain simplicity: The enemy of reliability is complexity. Complexity can’t be completely eliminated — the pipeline is doing something to the data after all — but it can be reduced. Minimizing and isolating the complexity in any one pipeline job goes a long way toward keeping it reliable
The Future of DRE
Data Reliability Engineering is a very young concept and numerous companies are helping to define the tools and practices that will make DRE as effective as SRE and DevOps. If you’re interested in exploring the concepts, the Data Reliability Engineering Conference is a good place to start. The first event took place in December with more than a thousand attendees and speakers from across the industry, including Looker, dbt, Figma, Datadog, and Netflix.