Why a DataOps Team Needs a Database Reliability Engineer

To borrow from Eminem, we forgot about DRE. The data reliability engineer — often called the site reliability engineer (SRE) for data or database reliability engineer (DBRE) — could be the missing role needed to create clarity in the ever-more complicated stack.
In fact, both of GitHub’s recent outages were because of database issues. “Usually you read about outages and there’s a lot of things, seven different contributing factors, architectural shifts over time,” according to Donnie Berkholz, Percona’s senior vice president of product management.
But this time, Berkholz told The New Stack, the total of 289 minutes of downtime of some GitHub services was due to two questions left unanswered: How do you design the architecture and database overall? How do you design well-optimized queries?
Since GitHub has adopted blameless post-mortems, we don’t — and shouldn’t — know who’s to blame. But two things can help prevent this entire class of database issues in the future, Berkholz said:
- Help developers write better database queries
- Help database designers write simpler databases
This is where the role of SRE for data — DRE, DBRE, or whatever you want to call it — comes in, to nurture a data-driven mindset and technology. To apply the principles of site reliability engineering, to educate developers about the database layer of your stack, and to abstract out its complexity so more can access it.
Interest in developing an SRE-like role for databases is growing. The next Data Reliability Engineering Conference will be held, virtually, in April. The first one, in December 2021, drew participants from companies including Datadog, Figma and Netflix.
Here’s how some experts believe the job description for DREs should be refined.
What’s the Gap a Data SRE Could Fill?
“The way we look at data is changing,” Dom Couldwell, head of field engineering for DataStax across EMEA, told The New Stack. “Historically we hoarded it: ‘It might be useful one day.’ More and more of the emphasis is now on the value of data — also [on] the idea that data needs to be real time.”
And, Couldwell noted, user expectations “keep going up and up and up.”
In response, how data is managed has to change. And to manage data in a way that can transition to the cloud, while expecting it will likely remain on-premise, as reportedly only one in five workloads is in the cloud currently.
“The perception of data is changing. It’s not just about storage,” Couldwell continued, because “now it’s a computable, usable item,” ideally benefiting all departments.
He reflected on his previous work life in API management, which aims to help make APIs more usable and developer friendly, by seeking a more consistent way of accessing services or data.
This is the next stage he envisions for database management: “You can’t just give people data. Developers want a friendly way to figure it out.”
He says this must reflect today’s expectations on data. In addition to being real time, those expectations include that data:
- Be ready for machine learning algorithms.
- Fuel citizen analysts.
- Offer universal access, like Data as a Service.
- Commingle with other data easily.
“We want data flowing as a river through an organization, not having these fetid pools of data all over,” Couldwell said.
Database administrator (DBA) is known as a high-stress job in tech. Couldwell argues the demands of the SRE role are better suited to modern data requirements of availability and uptime. A DRE, he argued, should sit above the DBA in a company org chart, and under the chief data officer (CDO), in order to:
- Cultivate a data-driven culture.
- Link technology and business goals.
- Focus on cross-organizational data accessibility.
- Referee data access.
- Prioritize data governance.
- Think about data catalogs and categorization.
- Teach others to understand what data they are storing or handling.
- Make use of data visualization.
- Offer context on the meaning of data for end users.
- Think about how to get data to the edge.
- Work with universal programming languages, not just R.
- Consider data at rest versus data in motion.
“The gap between a CDO and DBA seems wide — you need something in between,” Couldwell said. “A product owner for data, who thinks about data as a product: How is it controlled? How has it evolved? What does it mean over time? It’s not how much petabytes we are storing, but how much petabytes people are using.”
Yes, this is advocating for a new role at a time where tech staffing budgets are being sliced. But he argues this can help companies do more with less, while increasing uptime and lowering single-point-of-failure syndrome. By doing this the right way, Couldwell noted, organizations “may democratize data instead of creating more specialist roles.”
A Data Reliability Engineer’s Role in DataOps
The site reliability engineer’s role isn’t just about uptime, reliability and security, it’s about educating others in the organization about what systems need to run well. It’s the same with the DBRE. Their goal is to create citizen data analysts and to empower developers to be able to work more closely with databases.
The DBRE, as Percona’s Berkholz calls it, should be a customer-facing role, or at least one that protects the “internal crown jewels of customer data, of everything we know about them.”
To achieve this, he advocates for a T-shaped DataOps team: “You want to have a team that has a broad set of expertise, that brings in a lot of expertise from many disciplines. It’s impossible to be the full stack expert in everything.”
This T-shaped data-driven team, he continues, needs to have enough skills to solve most of the day-to-day troubleshooting, like:
- Resolving minor incidents.
- Making minor configuration changes.
- Speeding up a slow-performing query.
- Adding the index in the database.
- Redesigning the query’s structure so it runs faster.
“A data engineer is not a database expert either, so they may write queries very inefficiently,” Berkholz said, which is why storage and compute scaling should be automated.
But teams need to have their own expertise to design a database, and developers and SREs are currently not getting the database skills they need, he said, which is why a database site reliability engineer would sit on that T-shaped team.
This calls for prioritizing database education and, he contends, bringing in an external partner to train the first generation of DBREs — like what happened at the start of Kubernetes adoption. It takes an organization-wide commitment to upskilling and mentoring.
The ideal data reliability engineering stack, he argued, would provide developers with database self-service that has guardrails set by the DBA, including General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) limitations to where data can live.
But software itself won’t be enough without more knowledge about databases, Berkholz cautioned. “No matter where your application runs, you’re still responsible for the performance and availability of that application,” he said, paraphrasing Charity Majors, co-founder of Honeycomb.io.
“Software can help but you still need to maintain an understanding of how you get a highly reliable and available application. Software just helps with usability, but then you do the right or wrong thing.”
Getting Developers on Board
The story of Kubernetes and that of databases in the cloud native era parallel each other, according to Berkholz and other experts. A DBRE should be focused on “building for cloud native from a database perspective,” he said. And the key question that should face them is, “should you abstract or do you want to retain control?”
Everyone in an organization needs to be able to look down and understand how to fix basic problems with the database, to understand how it works, but they also need much of the complexity abstracted out. Because data is too important to be relegated to a small group of people in an organization.
Developers must be one of the users considered when databases and database management tooling is built. A database is an opportunity to further connect developers to the business logic. It’s the database reliability engineer’s job to design databases that are clear to a developer audience, too.
Before Ovais Tariq co-founded Tigres Data, he headed up storage infrastructure at Uber, where he worked with a lot of full-stack developers fresh out of boot camps.
They arrived with a grasp of both frontend and backend, but backend in an e-commerce application, he told The New Stack, “can mean many things. Could be backend of mobile app like rider profiles, how you match drivers, how you deploy machine learning models to figure out right pickup time, [and then for] UberEats, what food is good to promote at what time of day.”
The piece of the stack knowledge they were still missing was an understanding of the database. There needs to be a way, Tariq contended, to abstract out the data science programming languages that developers wouldn’t need to otherwise understand, while still giving them access to understanding and even making routine database fixes.
Infrastructure tools have to evolve to be more accessible to developers, he said. “Application developers as a whole do not have the knowledge to be able to use databases the right way.” And, again, this is where the emergent DRE role can help bridge the gap and build databases that are easier for devs to use.
Where Do Ethics Fit into Data SRE?
We’ve moved from “data is the new oil” to “data is a river delta,” flowing from place to place, sometimes spilling over, not always knowing what came from where. Jennifer Prendki, CEO and founder of Alectio, a DataPrepOps company, shared some thoughts with The New Stack about how ethics, compliance and governance need to fold into this new SRE-for-data job description.
“Ten years ago as a data scientist, if your model wasn’t working, you’d dump more data, not paying attention to the reliability of the information, the credibility of the author — and, even if the information is credible, there is a lack of giving credit to the right person or attribution,” she said. “For the longest time, the industry didn’t care about this.”
And then came Cambridge Analytica and its algorithmic influence over the Brexit vote in the U.K. “It didn’t change anything, no one was responsible,” Prendki said. Now there’s the GitHub Copilot controversy, she added, where “people who write a little piece of code to a repo are shocked their data is being used for training.”
In over a decade of Big Data, regulation has been slow to emerge, and so most companies still hoard and share data at a great financial, environmental and privacy risk. “The whole market has been racing to get as much data as possible, because there is a Big Data lobby out there,” Prendki said.
But as more lawsuits around generative artificial intelligence start popping up, alongside whoppers of GDPR fines, organizations are starting to show interest in best practices and legal rules, but lack the expertise.
“People in this world don’t have the level of technical competency because other organizations use people’s data as well,” Prendki said. “You’re going to need SREs and you’re going to need more than that — SREs to advise legal teams and governance, best practices about when to use [which data], limits for which data should be used.”
She sees this new SRE-like data role as an extension of a data governance expert, maybe within organizations, but also could be external auditors: “If you’re going to use the world’s data, I don’t necessarily think a person within the company is the right person to make the decision.”
As the role of database reliability engineer evolves, it may be as different as the data reliability needs of each organization. We look forward to telling more stories about how this role develops.