Taking Data Curation to a New Level
Today’s digital world has changed modern enterprises’ relationship with data, peppering organizations with immense promise paired with an evolving slate of challenges. While data-driven decision-making is becoming the norm to quickly and effectively address said challenges before they become insurmountable, the world of data has reached an inflection point that will only deepen the enterprise-to-data bond.
In the not-too-distant past, data users received packaged data reports to analyze and absorb, which soon evolved into a generation of self-service. Now, data is distributed across an organization, and companies prioritize data-sharing organization-wide. Today’s leaders empower data and business users across their organizations to discover and explore trusted data — weaving it into everything they do.
Common Data Challenges
The challenge? Data is stored in siloed environments, from the cloud to on-premises, making it difficult for producers of data products to understand what exists. The immense amount of data can create bottlenecks and confusion. Furthermore, data is spread across wiki pages, data dictionaries, email, chat, social and raw web content. These high volumes of data often come with redundancies, making it hard for data consumers to fully understand the breadth of what data exists. This hinders a user’s ability to find a single source of truth.
The second challenge is data understandability, context and trustworthiness. Very little data within the massive enterprise data sprawl has a description, associated owner, date of creation, intended use/purpose, usage stats, data quality indicators, etc. Finding data is only a fraction of the effort; piecing together the answers to all these contextual questions is the other and is currently done manually using brute force, phoning friends and making educated guesses. Understandability, context and trustworthiness are challenges shared by all data consumers, regardless of whether they are doing self-service, using delivered reports or using shared data products.
Lastly, organizations struggle to capture, associate and share critical knowledge about the data. This data, i.e., common term definitions, metric definitions, KPI definitions and business process/workflow definitions, must be shared to increase data literacy and effectively use data in an enterprise.
Enter Data Curation
Let’s revisit the definition of curation that Dave Wells, noted writer and expert, offered in his 2019 blog. Wells stated that “curation is the work of organizing and managing a collection of things to meet the needs and interests of a specific group of people.” At the time, “things” was primarily thought of as datasets and files. Now the scope has broadened with the realization that asset classes that need to be curated also include BI reports/dashboards, models, metrics, terms, glossaries, domains and much more.
While this covers the “things” to be curated, how are they organized and managed? Maintaining metadata attributes for each asset class helps to keep things organized. For instance, curated metadata attributes for a column might include a description, top users, associated queries, data quality score, last updated data and sensitivity classification. Curated metadata attributes for a BI report may include creation data, report owner, certification classification, update date, delivery group/list, run schedule, etc.
So the first step in modern curation is deciding which asset classes will be curated and what specific attributes will be curated.
The second step is to decide how you will measure the completeness, timeliness and accuracy of the curation efforts. This means you must look at each asset class and its attributes to decide what is an acceptable threshold for each.
The third step is to decide who will be responsible for overseeing the curation process. This leads to an interesting set of decisions regarding stewardship, types of stewards, data curators and crowdsourcing. In general, the metadata attributes of each asset class fall into one of three categories: business, technical or compliance related. Understanding that is a good starting point for constructing a RACI-style matrix (a responsibility assignment matrix) which shows who will be responsible and accountable for maintaining the curated attributes at or above the target threshold.
Construction of the RACI matrix also allows an organization to think deeply about which attributes can be populated and updated by their data consumer community and which need to be more closely managed by authorized users.
Challenges of Successful Data Curation
Like the issues pushing many companies to data curation, curators must overcome challenges. As discussed above, getting a handle on the metadata as the volume rises is a significant hurdle to overcome. Additionally, the internal pace of change as it moves to a data-forward organization will affect the success of the data curation journey. Especially in times of transition, it is important to prioritize data sets and areas of the business to first implement data curation.
The breadth and depth of assets that need to be curated also present challenges for successful data curation. Not only is an immense amount of metadata coming into your organization that needs to be organized, but terms, metrics, queries and more have to be derived and analyzed from all of that metadata. Companies must also consider the depth of those attributes and that each piece needs to be specifically curated to meet varying demands across different sectors within the business.
Last, a major challenge of successful data curation is that companies need to be more thoughtful about their implementation approach. Companies want to see the benefits quickly. However, they need to think about implementing new processes without disrupting day-to-day business and without making employees feel like they are gaining extra work. The platform must be easy to use and digest to help employees move fast without disrupting work.
Successful Data Curation Requires Bringing Humans and Machines Together
Understanding data is extremely important. But to know what to trust, you have to understand the intricacies of the data: how it maps to business processes, how recent it is and how it is being used. This requires a balance of machine learning (ML) and human intelligence. Companies can adopt a few key tactics to leverage ML and AI products into their data curation process.
The first tactic is to automate curation assistance to handle the volume of work. This can help with classification, naming, tracking the popularity and top users of different data sets, and prioritizing the correct datasets. Additionally, you can use automated stewardship to spread the workload and scale across the entire enterprise.
A successful marriage between the human aspect and ML can help ease the overall transition into a healthy curation management process. With the right mix, you can crowdsource curation capture, part of asset reuse and collaboration, which can help people expedite their current jobs.
AI and ML can be extremely helpful with tracking and governance throughout the entire organization. Automating measurement monitoring and task routing helps to keep track of data and who is making changes to it. Additionally, curation process automation via bots can help governance by triggering alerts and notifications to keep employees looped into anything out of place in their datasets.
Curating Data with a Data Catalog
Data catalogs organize all data assets spread across a company’s various systems. A data catalog documents tribal knowledge and best practices by presenting the data in context. The key question is, what do you use to manage the curation process and all these data and data-related assets?
For instance, with the help of Alation Data Catalog, data curators can create a broader awareness of how data can be applied to make informed decisions, improving the accuracy of data knowledge in the organization.
By inventorying, classifying, and curating data and knowledge, Alation provides unparalleled visibility into enterprise data assets. In contrast to a time-consuming top-down, siloed approach, Alation enables organizations to focus their governance efforts on the most critical data assets to have the most significant effect on the business. A data catalog’s value stems from its ability to surface the connections and context around different datasets, and Alation Data Catalog allows for connections to almost all data connectors.
The context added by the Alation Data Catalog supports employees’ productivity and increased efficiency. Alation’s platform allows employees to see a dataset’s entire work history and gives them the ability to include notes right next to the data. This also includes automated expert identification pinpoints that route employees to the steward responsible for ensuring that questions are directed to the appropriate person. Those questions and full conversations are also saved in the data so that future employees can quickly search and understand the dataset.
Another key benefit of Alation Data Catalog is the innate ability to help companies improve the overall quality and governance of their programs. The first is more straightforward, as the platform automatically scans data and flags potential discrepancies. On the governance side, the Alation Data Governance app centrally automates and manages policies, workflow, stewardship activities and more. The app is complemented by a professional service offering called an Active Data Governance Blueprint.
The importance of data-driven decision-making can’t be understated. As companies continue to grow, data curators have the task of finding a needle in a haystack. Implementing these best practices and working with a partner like Alation can help bring order to data. Ultimately, investing in data knowledge with a data catalog can inspire data-savvy individuals and promote a data-driven culture where data is the bond that drives business outcomes.