Getting Cloud Data Lakes Right
Most business users track top line, bottom line and customer experience metrics using BI tools, which operate on small sets of relational data (a few terabytes) in a data warehouse, and require small data scans (a few gigabytes) to execute. Data analysts, the primary users of business intelligence (BI) tools, collaborate with data engineers to write the SQL scripts that load the data they need.
But nowadays, gaining a competitive advantage from data goes beyond BI to applications ranging from interactive, streaming and clickstream analytics, to machine learning, deep learning and more. For these newer applications, data lakes provide the optimal architecture. Data arrives at different velocities, in different forms and from multiple sources, and gets staged and cataloged into a central repository. It is then made available for any type of analytics or machine learning application, at any scale, in a cost-efficient manner.
That’s the theory, at least, but we all know building a successful data lake isn’t as easy as it sounds. After a period of hype a few years ago, the inevitable backlash set in as some businesses realized they had built terabyte-scale data lakes that didn’t deliver on the value promised. With the continued onslaught of data, this isn’t a reason to give up on data lakes, however; it’s a reason to take a step back, examine what the industry has learned and build using best practices in the future.
There are three broad areas that require more attention from data teams to build effective data lakes: data ingestion, data layout, and data governance. This article takes each area in turn and describes key considerations and capabilities to ensure a successful project. We’ve focused here on cloud data lakes, because that’s where most new data lakes are now being created, for reasons of cost and scale.
Data Ingestion: Solve for Both Batch and Streaming
Data must be ingested into the lake before it becomes useful for analysts and data scientists. Hence, data ingestion lays the foundation for the effectiveness of downstream analytics. Architectural best practices for data ingestion include:
Managing Batch and Streaming Data Ingestion
Data flows into a data lake with variable velocity and in various types. Whether ingesting batch or streaming data, the data lake must ensure zero data loss and write exactly-once or at-least-once; handle schema variability; write in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed:
- Batch Data Ingestion: Batch systems predominantly move application data, logs and relational datasets from transactional systems to the data lake on a periodic basis. For batch ingestion of transactional data, the data lake must support row-level inserts and updates — UPSERT — to datasets in the lake. Upsert capability with snapshot isolation — and more generally, ACID semantics — greatly simplifies the task, as opposed to rewriting data partitions or entire datasets. Further, ACID semantics means concurrent writes and reads are possible on a data lake without data integrity issues or a perceptible reduction in read performance.
- Streaming Data Ingestion: IoT data and clickstreams require real-time analytics where data is collected at different velocities and in different formats. Even for streaming data, the data lake must guarantee that data is written exactly once or at least once. This is possible by using Spark Structured Streaming in conjunction with streaming data arriving at variable velocity from message queues such as Kafka that maintain offsets. (This can also be achieved using Amazon Kinesis.) A data lake solution for stream processing should integrate with the schema registry in message queues, where engineering can manage schema to prevent the breakage of data pipelines whenever an event’s schema changes. Also, the data platform must support replay capability to keep up with business evolution on stream processing and re-process/reinstate outdated events.
The data lake must support continuous optimizations that create aggregates and gold datasets from ingested data for downstream analytics. We will talk about this aspect in the next section on data layout.
Source-to-Target Schema Conversion
To migrate data from a database or data warehouse to a data lake, engineers must typically recreate the schema on logical entities all over again. This is mundane and time-consuming, and it requires keeping schema in sync with the source schema. Data in S3, such as log files, may not have a schema and may land in data lake as sparsely populated, deeply nested semi-structured datasets. Data engineers will benefit from the ability to intelligently detect source schema and create logical tables on the fly, and flatten semi-structured JSON, XML or CSV into columnar file formats. The schema of these tables should then be kept in sync to support continuous integration.
Monitoring Movement of Data
Data ingestion pipelines sometimes fail for reasons that are difficult to control, such as erroneous data, schema drift, intermittent platform glitches, and more. That’s why it’s important to connect pipelines and the underlying infrastructure to rich monitoring and alerting tools such as Datadog, Prometheus and SignalFx, to shorten time to recovery after a failure. As a best practice, erroneous records should be sent to an error stream for root cause analysis.
Keeping Data Fresh
Businesses demand data availability to reduce time to insight. Data restatement and row-level data inserts using UPSERT capability — INSERT, UPDATE, MERGE — is an important consideration in data lake architecture to keep data fresh. It is cumbersome and expensive to rewrite partitions or entire datasets in order to update a few rows.
Data Layout: Optimize for Machine Learning and SQL Analytics
Data generation, and by extension data collection, is bursty and continuous. Machine data is often sparse and collected in semi-structured formats such as JSON or CSV. Recorded media such as video, images and free text are in an unstructured format. Attempting to inspect, explore and analyze these datasets in their raw form is arduous and slow, because the analytical engines scan the entire data set across multiple files. A better approach is to plan ahead and build ETL pipelines that reflect a well-defined layout strategy for frequently accessed data. The key is to reduce data scanned and query overheads using different techniques.
Use Columnar Data Formats for Read Analytics
Publish data in open source columnar formats such as ORC and Parquet to reduce data scans. These machine-readable, binary file formats are optimized for read analytics. Proactively avoid queries that need to parse JSON with functions such as json_parse and json_extract as they scan the entire dataset. Flatten frequently accessed JSON values by cleaning them, casting to a data type and storing in a columnar format.
Partition data by frequently used predicates (SQL WHERE clause) such as time, geography and line of business to reduce unnecessary data scans. Since partitions add metadata to the metastore and result in lookup overheads in query execution, it is important to tune the partition granularity, sized based on the dataset under consideration. A typical requirement is to partition by year, month, week, day or hour — but not by minute or second. Ingestion systems will take care of parsing event timestamps and writing to the correct physical location of the partition and updating the metastore.
Use Compaction to Chunk Up Small Files
The bursty arrival of data, along with real-time stream ingestion, results in data written into multiple files of different sizes in the cloud object store. Network calls in the cloud object store are expensive and slow compared to an on-prem model, which creates a drag on read performance. Asynchronously chunking up small files into bigger files through compaction reduces these network overheads and speeds performance. Compaction is necessary and a better strategy than attempting to tune ingestion systems based on data arrival patterns, which are unpredictable or at least hard to predict.
Collect Statistics for Cost-Based Optimization
Collect and maintain statistics of the dataset such as file size, rows and a histogram of values. The cost-based optimizer in the analytical engine’s runtime can significantly improve performance by using the available statistics to optimize queries through techniques such as join reordering.
Z-Order Indexed Materialized View for Cost-Based Optimization
A materialized view (MV) is an ordered copy of data based on a particular sort key. Analytical engine runtimes can use materialized views to selectively scan data for query plans with filter conditions. Indexes should be maintained not just at the table-level but also the partition-level. While more than one materialized view can be configured for a given table, that requires additional storage and computing to keep it updated. Z-order indexing on a single materialized view helps solve this problem. A z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.
The concept of a managed data lake is to deliver autonomous data management capabilities that can be used to operationalize the aforementioned data layout strategy. Data engineering is faster and easier with a managed data lake.
Data Governance: Manage Outcomes
When data ingestion and data layout are implemented well, data can be made widely available to users in a democratized fashion. When multiple teams start accessing data, data architects need to exercise oversight and manage outcomes. Enterprise-grade data platforms that serve important customers and deliver meaningful experiences need to blend the best of innovation with oversight, regulatory compliance and role-based access controls. In this context, data architects will find a single pane of glass essential for configuration management, auditing, obtaining job reports and exercising cost control.
Discover Your Data
Data itself is hard to find and comprehend and not always trustworthy. Users need the ability to discover and profile datasets for integrity before they can trust them for their use case. A data catalog enriches metadata through different mechanisms, uses it to document datasets, and supports a search interface to aid discovery.
- Use crawlers and classifiers to catalog data. More and more datasets are streaming in from sensors and customer interactions, and data users spend a lot of time searching for these semi-structured datasets. Automatically adding descriptions about the context of how the data came in, and keeping the metadata and data in sync, will speed up the end-to-end cycle from discovery to consumption.
- Data dictionary and lineage. Annotating data where it resides is a powerful way to qualify datasets. Besides schema and the physical location of data, data dictionaries contain table and column descriptions, the most frequent users and usage statistics, canonical queries that include the said table, and so on. Moreover, data lineage allows users to trust data for business use by mapping the data lifecycle from its origin, including how it was modified along the way and who touched it.
- Metadata management. Answering questions that come up on a need to know basis, like a customer churn analysis to understand how users buy and what biases they have, typically requires wrangling new and disparate datasets. Since the first step is to discover the required datasets, it’s essential to surface a data dictionary to end-users for exploration purposes, to see where the data resides and what it contains, and to determine if it is useful for answering a particular question. Discovery includes data profiling capabilities that support interactive previews of datasets to shine a light on formatting, standardization, labels, data shape, and so on.
Regulatory and Compliance Needs
New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around Right to Erasure and Right to Be Forgotten. These govern consumers’ rights about their data and involve stiff financial penalties for non-compliance (as much as 4% of global turnover), so they must not be overlooked. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. In addition to the throughput of DELETE itself, you need support for special handling of PCI/PII data, and auditability.
Permissioning and Financial Governance
Cloud data lakes facilitate instant access to data and avoid long procurement cycles. As an instructive example, a deep integration with the Apache Ranger open source framework facilitates table, row and column level granular access. Architects can grant permissions against already-defined user roles in the identity and access management (IAM) access solutions from cloud service providers. Since this approach builds atop defined user roles and policies, it simplifies permissions management for infrastructure, engines and tools, and data.
With wide-ranging usage, monitoring and audit capabilities are essential to detect access violations, flag adversarial queries, and more. While the cloud offers agility, it can come at a price if you take your eyes off of cost meters or don’t forecast computing needs. To give P&L owners and architects a bird’s eye view of usage, they need cost attribution and exploration capabilities at the cluster-, job- and user-level from a single pane of glass.
Getting cloud data lakes right is essential to building a sustainable advantage using the data you collect. Done right, a cloud data lake will break down data silos and facilitate multiple analytics workloads at any scale and at very low cost. Democratizing data and delivering new, improved customer experiences suddenly become achievable.
Feature image via Pixabay.