Apache Iceberg: A Different Table Design for Big Data
Like so many tech projects, Apache Iceberg grew out of frustration.
Ryan Blue experienced it while working on data formats at Cloudera.
“We kept seeing problems that were not really at the file level that people were trying to solve at the file level, or, you know, basically trying to work around limitations,” he said.
Those problems included the inability to reliably write to Hive tables, correctness issues and not being able to trust the results from its massively parallel processing database.
When he moved to Netflix, “the problems were still there, only 10 times worse,” he said.
“At Netflix, I spent a couple of years working around those problems and trying to basically patch them or deal with the underlying format. … I describe it as putting Band-Aids over these problems, very different problems here, there. We had a lot of different ones. And we finally just said, ‘You know, we know what the problem is here. It’s that we’re tracking data in our tables the wrong way. We need to fix that and go back to a design that would definitely work.’”
The outgrowth of that frustration is Iceberg, an open table format for huge analytic datasets.
It’s based on an all-or-nothing approach: An operation should complete entirely and commit at one point in time or it should fail and make no changes to the table. Anything in between leaves a lot of clean-up work.
With Hive, he explained, the idea was to keep data in directories and be able to prune out the directories you don’t need. That allows Hive tables to have fast queries on really large amounts of data.
The problem, though, is that what they were doing was trying to keep track of these directories. And that didn’t scale in the end. So they ended up adding a database of those directories. And then you would go find out what files are in those directories when you needed to query the data. That created a problem in which the state of a table is stored in two places in the database that holds the directories and in the file system itself.
“The problem with holding that state in the file system is that you can’t make fine-grained changes to it. You can make fine-grained changes to the set of directories. But you can’t make fine-grained changes to the set of files, which meant that if you wanted to commit new data to two directories at the same time, you can’t do that in a single operation that either succeeds or fails. So that’s the atomicity that we that we want from our tables,” said Blue, project management chair.
Netflix open-sourced the project in 2018 and donated it to the Apache Software Foundation. It emerged from the Incubator as a top-level project last May. Its contributors include AirBnB, Amazon Web Services, Alibaba, Expedia, Dremio and others.
The project consists of a core Java library that tracks table snapshots and metadata. It’s designed to improve on the table layout of Hive, Trino, and Spark as well integrating with new engines such as Flink.
One of its selling points is that users don’t have to know that much about partitioning.
“In the old model, the columns that were used to produce those directories, those were just normal columns, and they had no association to other columns, Blue said. “So if you wanted to store data by day, you would probably derive that date from a timestamp. But the system had no way of saying, ‘Oh, I know that you’re looking for this timestamp range.’ You had to add both the timestamps you’re looking for and the days that you’re looking for, which was just very, very error-prone. So we started keeping track of those relationships so that we can take queries on timestamp and bake those down into queries on the date ranges and automatically figure out what files you need.”
Iceberg users don’t have to maintain partition columns or even understand the physical table layout to get accurate query results, an IBM blog post explains. Iceberg handles all the details of partitioning and querying, and keeps track of the relationship between a column value and its partition without requiring additional columns.
In addition to addressing the reliability and correctness issues, the project focused on improving performance by using file metadata to can skip more files to satisfy queries faster, and in-place table evolution, so it can change as business needs change.
“Now we’re working on all of the new things that we can do, given this fundamentally better design for tables,” Blue said. They include adding row-level deletes and upserts. It just committed “merge into” as a high-level SQL operation and “delete from” and will add “update.”
“So those operations are a lot less targeted at knowing how your table is stored and laid out, and much more focused on what do you want to do to individual rows in your table? And that’s where we want our data engineers to be focused,” he said. The system can make things fast and efficient because it can figure out exactly which data files need to be updated, and then go rewrite those data files.
Blue said he’s excited about the capabilities with row-level deletes and the ability to able to build data services that can operate on tables that don’t require users to think about the details or physical layout quite so much.
Decoupling Compute and Data
“We want the data tier to support things like transactions and data mutations and time travel. And it needs to be open source and accessible to all these different engines, that the whole value of a modern, loosely coupled architecture. So Iceberg is a perfect fit for that,” he said.
Dremio, which aims to eliminate the middle layers and the work involved between the user and the data stores, has announced plans to integrate its platform with Iceberg this year. It has two projects related to Iceberg:
- Project Nessie provides a git-like semantics for data lakes. It enables users to experiment with branches of data or prepare data without affecting the live view of the data.
- Arrow Flight 3.0 provides the ability for Apache Arrow-enabled systems to exchange data between them simultaneously at speeds that are orders of magnitude faster than possible before.
He sees two competing standards in the space, Delta Lake, created by Databricks, and Iceberg.
One of the problems with Delta Lake, he said, is that you can only do inserts and transactions from Spark, while Iceberg allows transactions and updates in time travel from any system — from Dremio, Spark, Presto, etc.
“It comes to how data is stored,” Shiran said. “People are always going to choose the more open approach. We saw that with Parquet, when it came to file formats, right? There were competing standards at the time and some only worked with one engine like Hive and others work across the board. Parquet obviously won, and I think it’s kind of very similar situation.”