The Basics of DataOps and Why It Matters
GitLab sponsored this podcast.
The practice of DataOps has emerged as another variant of DevOps. It involves how data is managed. But DataOps remains a work in progress for many organizations.
For this latest episode of The New Stack Makers podcast, Alex Williams, founder and publisher of The New Stack talks with guest speakers Dina Graves Portman, developer relations engineer for Google; Emilie Schario, internal strategy consultant, data, for GitLab; and Nicole Schultz, assistant director of engineering for Northwestern Mutual. They discussed how DataOps is defined and why its application is particularly relevant in today’s highly complex and increasingly distributed environments.
Like DevOps, DataOps can be described as workflow-related but helps resolve data-management challenges. “I think of DataOps as creating a workflow or a way of working as a data team that creates efficiencies and allows you to get more done with the resources you have in a better, more stable way,” said Schario.
DataOps also helps to solve many issues IT teams struggle with related to data. Sadly, broken dashboards and incorrect tallies “seem to be the norm everywhere,” Schario explained.
“DataOps is really about catching those problems — it’s about creating a workflow that catches problems before they make it to your end user,” said Schario. “I think people are feeling the pain.”
A motto of Google site reliability engineers is that “hope is not a strategy,” said Portman. “DataOps is taking ‘hope is not a strategy’ and applying that to data and using all of the different tools that software engineers have been using for years.”
Automating data-collection processes are also important for implementing DataOps, Schultz explained.
“By using CI/CD DevOps practices, [IT teams] are automating more so you’re not reliant on one person doing one certain thing, while still ensuring that you have high data quality with the data-collection processes that you’re automating as you’re moving data throughout workflows,” said Schultz.
Version control should play a role in helping to maintain DataOps discipline just as code repositories have served the modern, distributed development process.
“It really starts with version control to solve the ‘single source of truth’ problem because you create your repository where you’re storing your code, which is the place where your code lives,” said Schario.
Working on data-related processes in a “version control environment” means “everyone’s working off a single collaborative code repository,” Schario explained. “That’s the first step in making sure everyone has the same backbone to the work they’re doing.”
DataOps should also help usher in a new era in data and database management, such as supporting the application of new technologies applied across cloud native environments. As a subset of automation, unsupervised machine learning applied to data analysis and management should reveal some amazing applications in the not-so-distant future.
Portman explained how unsupervised machine learning can help manage some systems for which DataOps teams might not necessary have direct control over. If there is a field that stores elements such as views or clicks in a 32-bit signed integer, the fields start overflowing, causing the numbers to look negative. Unsupervised learning can separate the dataset into unlabeled groups. Regular unsupervised learning on datasets could thus bring to light such issues with the data as they develop.
“How do you identify when something goes wrong when there’s an anomaly? I think unsupervised learning would be a really great way of understanding that when our data usually looks like ‘x’ and suddenly there is this part of it that looks like ‘y,’” said Portman. “That would be a very useful thing to do.”