The same story keeps surfacing at manufacturers, financial service companies, healthcare providers, and retailers alike: One group says their data lake project was wildly successful because they created an affordable central repository of nearly all enterprise data. But another group claims they have not yet exploited the new repository in their reports, dashboards, and applications. How could this be? We explain why this is happening and how to fix it.
What Did the Early Users of Data Lakes Expect?
The data lake was introduced as an answer to the problem of having important data locked up in silos of production applications and departmental databases. The assumption was that loading all that data into a single repository would be able to reveal important insights about the operation of a company. The trend was amplified by the fact that Hadoop made it possible to store large amounts of data on affordable hardware.
Looking back at the first generation of data lakes, we now know there was a real disconnect between the producers of enterprise data lakes and the consumers of them. The producers saw a very cost-effective way to capture and centralize all structured and unstructured enterprise data. The “schema-on-read” approach meant that all data could just be dumped onto cheap storage and dealt with later. No wrangling was necessary to get the data clean or in the right type or format for consumption. This hard work was delayed until the data was consumed – hence the term “schema-on-read.” This seemed revolutionary as enterprises raced to capture their ever-growing volumes of data.
But consumers of data had a completely different mindset back then. They thought this new big data stack was going to give them more than they had before; that this would make traditional data warehouse and analytical applications perform better and be more scalable. But this never materialized. BI tools such as Tableau, MicroStrategy, and Business Objects could not run without more programming on the data lake. Not only were their thousands of reports and dashboards not more scalable and performant – they just did not run. Additionally, Extract Transform and Load (ETL) tools such as Informatica and Talend that structured data for consumption no longer ran without buying new modules, changing pipelines, and actually writing code. Further, all that new data that was never captured — device data, mobile data, and external data from outside the enterprise — could not be put to work without significant “schema-on-read” work including wrangling, cleansing, filtering, and transforming.
The Expectation Gap
The primary reason for the expectation mishap is deeply rooted in the principle of schema-on-read. The first users of the big data stack were developers who were very nimble in taking data in any form, quality, and volume and writing code to manipulate it into the form they needed when they were ready to analyze. They were AI and machine learning developers who were deeply trained in distributed computing and machine learning. Fundamentally, they were programmers. They built ad-tech systems that optimized media placements, fraud detection systems that protected consumers and enterprises, defense and intelligence systems to protect nations, and they were roboticists building autonomous vehicles. They did this from scratch. These developers are highly trained computer scientists in high-demand, and thus are expensive and scarce resources.
But enterprise consumers of data needed the functionality they always relied upon for analytical applications. They are not coders so to speak. They needed the same SQL relational databases and data warehouses that they always relied upon. These tried and true platforms were highly functional and powered virtually all of the enterprise’s operational systems and analytical reports and dashboards. Plus, there are droves of highly skilled IT professionals who can design, develop, and operate these platforms already in place. The expectation gap was that the Hadoop-based Data Lake was supposed to deliver faster, more scalable databases and data warehouses, and what it did deliver was a set of programming libraries with APIs that enable you to code at a much lower level of abstraction the raw manipulation of data.
Requirements to Operationalize Data Lakes
A major misconception is that you can treat a data lake on Hadoop just like a database. In a data lake, the data is there, but the data has not been cleansed, indexed or operationalized. This becomes very apparent when a company’s data scales from just a few sources going into Hadoop to hundreds. Operational data lakes need the capabilities of the traditional SQL RDBMS and the traditional data warehouse combined.
- Ingestion: They need to be able to ingest many petabytes of historical data as well as stream in millions of data points from IoT and exogenous sources in real-time.
- SQL: They need to power traditional ETL pipelines, and BI reports and dashboards without significant reprogramming. It is no longer sufficient to say you integrate with BI if you have to have a yearlong project to make an old report useful again.
- Tables versus Files: Analysts and DBAs are conversant with tables and SQL. They are less comfortable in programming languages, with development environments, compilers, and file manipulation.
- Analysis: Operational databases need to perform the complex joining of data across many tables rife with aggregations and groupings to support enterprise analytical workloads.
- Concurrency: The real-world does not have a single data scientist playing with a dataset in a sandbox, operating independently from anyone else. Real-time operational data lakes have many producers and many concurrent consumers of data. The database capabilities to power many users is required to make data lakes operational.
- Backup and Restore: Both system error and human error are inevitable. Servers go down, entire data centers are lost, and operators occasionally make mistakes. As a result, an operational data lake needs an incremental backup and restore to enable the returning of a data lake to an earlier consistent state in light of unexpected events. Flat file systems of ad hoc changed files in traditional data lakes make this difficult. Database backup and restore capabilities and best practices enable operational runbooks.
- Updates: Wrangling data is hard and so is powering applications. Analysts need to change data to wrangle it into place without generating new sets of files to manage every time they need to massage a record or two. Additionally, application developers need to change records in place to power applications. Mistakes are made and data needs to be updated. One of the most underestimated needs for updates are the materialized aggregations of analysis. Keeping operational data stores often requires the materialization of summaries of data by geography, by department, or by other dimensions. Data lakes need to be able to frequently update these summaries to be real-time. Databases and data warehouses enable the consistent change of data in the database with security features to lock down privileges, encrypt, and compress data, and most importantly, to rollback data to previous consistent states if an exception is thrown in a process, pipeline or application.
Business Benefits of Operationalized Data Lakes
Operational data lakes finally deliver on the promise of big data. They enable entirely new sources to be brought to bear upon both operational and analytical applications. Operational data lakes are scalable due to the distributed storage of the Hadoop stack, and deliver the promise of making computation faster by also leveraging the distributed compute power of the big data stack. But they do this with the steadfast capabilities of the RDBMS and data warehouse.
Operational data lakes are revolutionizing industries. Retailers, manufacturers, and 3PL/4PL logistics providers are extracting data from ERP systems, merchandising systems, warehousing systems, and transportation systems to provide real-time global supply chain data. They are enriching these data lakes with real-time exogenous carrier data and weather data to power new real-time supply chain planning applications and available-to-promise (ATP) systems. These systems are capable of both learning from experience and planning in real time.
Healthcare providers are extracting clinical EMRs (i.e., Electronic Medical Records), patient data and operational data to power new predictive applications that assist clinicians in caring for patients and help optimize hospital operations.
Financial service institutions are extracting client and advisor data into operational data lakes that help answer real-time questions like, who are my most profitable clients, which ones are likely to become highly valuable, and who are my most effective advisors.
Imagine what marketers can finally do if every click and mobile access across selling and marketing channels are captured in real-time and accessible by operational applications.
And finally, companies with extensive networks of engineered equipment such as telcos, networking companies, utilities, and oil and gas companies need to avoid service outages. Their operational data lakes store real-time data from every component in the network and power predictive applications that anticipate the next failure event so that it can be proactively avoided with predictive maintenance.
How We Operationalize Data Lakes
Splice Machine is a unique data platform that is purpose-built to operationalize data lakes. It is the only seamlessly integrated SQL RDBMS and data warehouse built on the big data stack. We deliver all of the requirements above into one platform that uses multiple engines “under-the-hood” to deliver this diverse capability. We call this On-Line Predictive Processing (OLPP) because it combines OLTP (for ingestion, updates, concurrency, and backup) with OLAP (for analytics and machine learning).
Splice Machine stores data on Hadoop and cheap, block storage. It accesses data with needle-in-the haystack lookup and ingestion speed of Apache HBase and in-memory analytical computation of Apache Spark. Users of Splice Machine do not have to deal with the low-level distributed system programming usually necessary to deploy these engines. Instead, just issue SQL to the engine and it will determine how to best process a query and what compute engine to use.
How Different Constituents Interact with Operational Data Lakes
Analysts can access a Splice Machine-powered operational data lake via standard BI tools via a standard JDBC/ODBC adapter. Data engineers can use their standard wrangling and ETL tools like Informatica and Talend. Application developers can write real-time programs in their language of choice including Java, Python, Scala, Ruby, Node.js and C++. Data Scientists can use our Apache Zeppelin notebooks to experiment and collaborate with direct in-process access to Spark DataFrames in Scala and Python. They also can write direct functions in R and and SAS against the SQL data store.
Whatever a user’s access path to the data platform usually is, it is vital to support it.
How Companies Deploy an Operational Data Lake
Nowadays, some companies have stringent privacy and security requirements and deploy their data lake on on-premise Hadoop clusters. Or they can deploy it through a Database-as-a-Service on public clouds. No matter the platform, operational data lakes give companies in diverse industries, such as financial service, oil and gas, healthcare providers, and manufacturers, the needle-in-the haystack speed of Apache HBase as well as the in-memory, distributed, analytical firepower of Apache Spark. Operational data lakes deliver the promise of the data lake and provide the real-time performance users need.
Feature image via Pixabay.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.