What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.

How to Build a Modern Data Infrastructure Using a Lakehouse

There is growing interest in building modern data architectures on lakehouses. Here is what you need to know to make it happen — from the folks at Persistent Systems.
Sep 12th, 2022 2:00am by
Featued image for: How to Build a Modern Data Infrastructure Using a Lakehouse
Feature image via Pixabay.

Data continues to be one of the most critical assets and transformational drivers across small and large enterprises. However, enterprises are grappling with constantly growing data, making it difficult for them to operate and manage with their existing data lake and data warehouse architectures.

IT and businesses want a better approach to their data platform strategy so that architects and engineers can spend less time worrying about the plumbing — integrating different components, making them talk to each other — and more time building data solutions.

If your organization is serious about modernizing your data management, you’ll need an architecture that can help you achieve that goal. We’ve all heard the recent buzz about lakehouses, but how did we get here?

Data Warehouse vs. Data Lake

Purushottam Darshankar
Purushottam Darshankar is chief data architect at Persistent Systems, based in London, U.K. He has over 25 years of experience in IT and is responsible for the design, architecture and delivery of big data solutions in banking, financial services, health care, retail and utility industry domains. Prior to his current role at Persistent Systems, Purushottam served leading multinational companies like Wipro, Siemens and Reliance Communication. He earned his master’s degree in electrical engineering from the Govt. College of Engineering, Pune, and in management studies from IIM Kolkata.

Prior to the recent advances in data management technologies, data warehouses were the architectural standard for storing and processing business data. While data warehouses were extremely reliable for structured data, the platform architecture began to falter with the introduction of unstructured data.

Also, with the data warehouse model, there was a lot of data preparation and data movement in the form of ETL (extract, transfer, and load) before you could run data analysis, and therefore the time-to-insight was often very slow.

As a solution, enterprises began to build data lakes, low-cost repositories that stored any type of raw data, structured or unstructured, in an open file format, so that it could be transformed later to handle different business use cases.

Unlike data warehouses that enforce a table schema, data lakes do not enforce schema, thus allowing unfamiliar data sources to be stored, including video, text files, images, audio, etc. While the data lake was a promising strategy, “data swamp” — disorganized data — quickly caught on due to improper data governance, resulting in stale, unused data within a data lake.

So how can organizations get the best of both worlds that includes the benefits of data warehouses and data lakes?

A Combined Approach — Data Lakehouse

Since data warehouses and data lakes both have benefits, enterprises came up with a new data management architectural pattern — a lakehouse that combines the low-cost, scalability and flexibility of a data lake with the data management and data structure of a data warehouse.

Separating storage and compute allow for increased availability and scalability at a lowered cost. The compute uses separate clusters and can scale independently of each other, depending on the type of workload.

Lakehouses support ACID (atomicity, consistency, isolation and durability) transactions, which ensures that data reliability and data integrity, in cases of failure or when different components are performing concurrent operations.

They also enforce schema standards and apply governance, ensuring that data within a data lakehouse is properly organized, governed and consistent.

And lakehouses allow BI (business intelligence) tools to directly access data, which improves the freshness of the data used for real-time reporting.

A robust, modern enterprise data architecture integrates a data lake, a data warehouse using the lakehouse approach and other purpose-built functionalities for unified data governance and seamless data movement.

Over past years, while the core data-processing systems have more or less remained the same, the supporting tools and platforms have proliferated rapidly. The right combination of individual tools and technologies should give you the ability to build the right modern data platform for your business.

Modern Cloud Reference Architecture

A modern enterprise data architecture enabled by a lakehouse provides accessibility, speed, flexibility and reliability so that enterprises can optimize every data source and use it for better business decisions. Now that you have a data lakehouse, you still need a host of supporting services.

Think about an actual lake house (such as an Airbnb vacation getaway). Lake houses offer visitors a great view but still need supporting services like trash removal, cleaning and upkeep, landscaping, security services, etc. For data lakehouses, the current platform ecosystem comes with a range of characteristics, and the important ones are listed below.

Automated Workflow and Orchestration

With automated workflow and orchestration on the cloud, enterprises can ensure that the data flows smoothly and freely to all parts of the organization, while maintaining data quality and governance.

Vendors such as Airflow/Astronomer, Prefect, and Dagster provide tools to orchestrate analytical and operational workflows.

Data Pipelines and ELT Processing

This is core to the cloud data platform, which guarantees that data arrive at its destination accurately, on time and in the right format.

It has evolved from traditional ETL (extract, transform and load) vendors to a new class of cloud native players (e.g., Fivetran, dbt, and Matillion) that are capable of handling more complex dependencies across different data environments.

AI/ML (Artificial Intelligence and Machine Learning)

This includes advanced analytics that applies ML and algorithmic modeling to optimize business decisions. This space is flourishing, which is evident from the rising number of vendors supporting it.

Libraries such as PyTorch, TensorFlow, and Rasa provide AI/ML algorithms for data scientists to train data models, and notebooks such as Jupyter and Zeppelin help them customize AI/ML algorithms.

Data Governance and Security

As data stack becomes increasingly complex, data governance and security have become more critical to secure data and maintain compliance throughout the data lifecycle. With stringent data security guidelines, access to data needs a controlled data access and authorization mechanism.

Laws and regulations, such as GDPR (General Data Protection Regulation) and HIPPA (Health Insurance Portability and Accountability) regulations, have been enforced by governments to protect PII (personally identifiable information).

Vendors like OneTrust, Collibra, Privacera, and Immuta are helping enterprises address part of their security needs that are under regulatory oversight.

Data Observability

Data observability is a recent addition to the list of data platform capabilities and refers to providing monitoring and diagnostic capability across the data flow.

The tools (Monte Carlo, Great Expectations, and Bigeye) provide automated monitoring, alerting and triaging to identify and evaluate data quality and discoverability issues.

In summary, there is growing interest and clarity about modern data architecture built on lakehouse fundamentals, and it is being supported by a wide range of vendors (including Amazon Web Services, Google, Azure, Starburst, Databricks, etc.) and data warehouse players.

Cloud data warehouses like Snowflake have grown rapidly, focused largely on SQL users and BI use cases. But enterprises have seen accelerated adoption of other technologies like data lakehouse provided by Databricks.

With a data landscape modernization offering, Persistent Systems helps enterprises modernize their data landscape on the cloud.

Of course, choosing the lakehouse approach is only a fraction of the work. A good data model with optimized data-processing flow is a must for performance and cost optimization of the data platform.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Astronomer, OneTrust, Privacera.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.