How Do You Weave a Data Fabric?
With the explosion of both structured and unstructured data coming on the heels of smartphones and IoT devices comes the need to be able to work with massive amounts of data, mine it and make it accessible. Enter data fabrics, a way to help make sense of the terabytes, then petabytes, now exabytes of data that are passing through cloud systems.
Two years ago, The Economist proclaimed that data is the new oil — the world’s most valuable resource. There’s gold in those databases, one just has to find it, extract it and refine it to distill its value.
Data fabrics can provide a catalog of consistent data services across private and public clouds. To explore data fabrics — what they are, why they are needed and how to make your own — I talked to three leading experts in the field, Anthony Lye, senior vice president and general manager of the cloud business unit at NetApp, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, and Isabelle Nuage, director of product marketing at Talend.
All these companies provide data fabric as a service.
What Is a Data Fabric?
Unlike a lot of technical terms, data fabrics have a consistent definition across the three companies that provide Data Fabrics as a Service (DFaaS). NetApp’s Kurian coined the term “data fabric” about five years ago as they were building their in-house project, according to Lye. People need data to be protected, secure, integrated, orchestrated and costed, he said.
A data fabric, said Nuage, “is a single, unified platform for data integration and management that enables you to manage all data within a single environment and accelerate digital transformation.” She listed common DFaaS characteristics:
- It connects to anything via pre-packaged connectors and components
- Manages data across all environments (multi-cloud and on-premises)
- Supports batch, real-time, and big data use cases
- Offers built-in machine learning, data quality, and governance capabilities
- Enables data integration and application integration scenarios
- Provides full API development lifecycle support
The best visual, said Norris, is “a fabric that stretches across location, data types and access methods.” In the past, each of those pieces required a different data persistence mechanism. A data fabric is a persistence layer that stretches across all of these.
Enterprises are getting pressure to democratize data access and open it up. At the same time, they must make sure it’s secure, protected and available. One role of the data fabric is to allow those two opposing forces to work together.
In the beginning, data fabrics were just about data on-premise and behind firewalls, said Lye. Now with the explosion of cloud technologies, it has become exponentially more complicated.
“Data has gravity,” he said. “It doesn’t like to be moved. It’s easier to move the compute to the data rather than the data to the compute.” Luckily, available tools like Kubernetes make this easier. And with Istio, you create compute classes, so you can then move the compute to where the data is.
Typically, data fabrics are used by customers wanting to improve business process and gain efficiencies across big data analytics, predictive maintenance, risk and fraud. And, of course, compliance, said Nuage. “For example, using a data lake to create an inventory of all customer Personal Identifiable Information (PII) data to comply with General Data Protection Regulation (GDPR).”
Most Diverse Use Cases
A good example of the value of data fabric, Norris said, could come from the real world: Blasting-as-a-Service (BaaS). Say someone wants a building demolished, but needs to leave the buildings on either side untouched. How much dynamite will that take? Instead of guessing they hire the BaaS company. The company starts by asking a lot of questions. What size hole? What do you need to be demolished? They’ve been collecting a wide variety of data on their demolitions for several years, so they match the requirements with their data and determine exactly what is needed. This is made possible in part by their MapR data fabric.
“It’s very precise,” said Norris. “This is the wave of the future — the ability of the company to marry analytics with the core value of their service.”
Along the same lines but in a completely different industry, is Domino’s Pizza, the largest pizza company in the world, with a significant business in both delivery and carryout pizza. In order to keep a competitive advantage, the company now provides a way for you can now order pizza from your TV, smartwatch, car entertainment system or social media platform. This created 17TB of data, from 85,000 data sources, both structured and unstructured.
Using Talend’s data fabric, Domino’s gather data from the company’s point of sales systems, 26 supply chain centers, and across digital channels including text messages, Twitter, and Amazon Echo.
“We’ve become an e-commerce company that sells pizza,” says the company website.
Over in the pharmaceutical industry, weeds that damage crops used in making medication are an expensive problem. Bayer is applying AI technology to weed identification, thereby allowing farmers to apply the exact solution needed to kill each weed species.
So, Bayer Digital Farming developed the new application called Weedscout, using Talend Real-time Big Data.
Farmers all over the world can download Weedscout for free. The app uses machine learning and artificial intelligence to match identify photos of weeds the farmers load into the app. This allows the grower to make a better choice regarding seed variety, the application rate of crop protection products, or harvest timing.
“This is part of the effort to increase yields, while still considering the environmental footprint of agriculture,” stated the Talend’s website.
How to Make Your Own
Sound useful? Sure there are DFaaS out there, but what if you want to build your own? What exactly goes into a data fabric?
Norris, author of the white paper Twelve Rules for Data Fabrics, says that the first step is to create an overall data plan before addressing any technology. “DataOps,” he said, “is the key to understanding the requirements for data.”
It’s critical to address core issues including how data access is governed and how consistency is managed across locations. For the core capacities, you need to address how the data fabric stores data, linear scalability, ability to distribute metadata and the arch that supports the scale and consistency of that data is a core building block.
How a fabric is secured is very important, he said. Often security is tied to access method, but data itself isn’t secure. In a fabric, granular security is needed at the data level regardless of how it’s being accessed.
Look at your system as a whole, Norris said. You need to address how it supports the access of data. Mixed data access across multiple protocols are common. Also, can it support multitenancy, data at rest and data in motion in the same fabric? How can you update the same fundamental data regardless of how you access the data?
“Most of all,” he said, “you need to answer the question: is the data consistent?”
The Data Fabric Stack
A data fabric has four layers, used together or individually, said Lye. The lowest layer is data storage. Here, there’s a set of APIs that enable the management of protocols up to customers through user interfaces.
Next is data services. APIs help manage several services at this level, including data protection, moving data (and what connections are needed to move with the data), securing the data (who’s accessing the data, should they be?) and inspecting and classifying the data (is it an order? A resume? Is in the right place?).
The next layer is the control plain, which includes tools to circle the data up to the people who will use them, typically the Site Reliability Engineer (SRE). It’s typically the SRE responsibility to optimize workload dimensions. Kubernetes services live here.
The top layer is analytics. Each of the companies has a view into the data fabric that monitors all of the underlying infrastructure, typically published through REST APIs.
What Skills an Engineer Needs
Intrigued? What skills should you be looking to acquire to become a data fabric expert? Actually, said Nuage, when your company buys DFaaS, not much.
“With data fabric that provides a vital drag and drop user interface, a data engineer with exiting skills can get up to speed very quickly with new technologies to process and transform data at scale such as Spark or Serverless,” she said. Part of the point of a data fabric is to make the data easily accessible for the more casual ad hoc users or citizen data engineers.