Apache Iceberg 101: The Table Format Reshaping Data Lakes

Sean Lynch
4 September 2024

If you’ve been following the data space over the last few months, you can’t help but hear about Iceberg. Originally created by Netflix and then handed off to Apache, it’s an open source table format. It was built to solve a high scale technical problem: support reliable data access across, multiple concurrent readers and writers, over updates and metadata changes. It’s also a philosophy: Separating storage of data from compute. Netflix built this to help them scale, but this very simple idea has the potential to upend how companies architect and activate their data.

Today, Iceberg is complex and rapidly evolving so before we get into the details, lets start with a cheatsheet.

  • You know CSV. CSV is a file format. So is Parquet, which is a more advanced file format for storing large datasets (it uses columnar structure to make storage and querying more efficient).
  • Iceberg, on the other hand, is a table format meaning it represents a database table with a cluster of different files for data (written in Parquet frequently) and metadata. Delta Lake is a competing open table format that currently underpins much of Databricks.
  • Iceberg needs a Iceberg Catalog - This is a small service that’s responsible for keeping the table format up to date. This is different than a data catalog like Atlan or Select Star. They do different things.
  • A Data Lake is the concept of using cloud object storage like S3 or GCS and storing your data in open file/table formats for high volume as cheaply as possible. Writing this into a data warehouse directly would be expensive both in terms of compute and storage. And every modern data warehouse has a way to access files in the data lake.
  • A Data Lakehouse is a conjunction between a Data Lake, and the ability to do things like running SQL queries, running batch jobs, and setting up data governance schemes.

Why Iceberg Matters

Iceberg was originally created for scale–size, distribution, complexity, and concurrency–on top of cloud object storage like S3 and Google Cloud Storage. It delivers on the promise and has led to some significant benefits.

Iceberg is built on open formats like Parquet so you can avoid some amount of lock in (we’ll talk more about that in a minute). But it doesn’t require sacrificing querying capabilities to get there. It includes some very powerful capabilities such as Time travel. This means you can query an existing table by specific timestamp or version (eg Spark SQL):

SELECT * FROM my_table TIMESTAMP AS OF '2023-03-15 10:00:00'

or

SELECT * FROM my_table VERSION AS OF 10963874102873

But one of Iceberg's key features is the complete separation of storage and compute. When Snowflake introduced the idea of scaling storage and compute independently, it was revolutionary. Iceberg takes this concept to the logical conclusion. With Iceberg, your compute can be entirely separate from your storage.

The approach lets you choose the storage optimized for your use case, your cloud, and your tradeoffs on cost. And then connect it to whatever querying and compute you need. Use a microservice to capture data. Use Spark to implement a custom ML model for scoring. And then use Snowflake for ad-hoc querying and reporting, all on the same data.

This separation has an added benefit: data sharing. The compute reading/writing the data does not only have to be yours. You can take advantage of compute provided by other apps in your stack to write and enrich your data. You can also grant direct access to data to your partners or customers. This enables a Zero Copy / Zero ETL approach to data integrations. Don’t copy data around, just reference it. (As you can imagine, we care about data sharing a lot at Census 🙂).

So when does using Iceberg make sense?

  • High Scale: If your company is dealing with petabytes of data or more, Iceberg's design for handling scale becomes very attractive.
  • Cost Optimization: For companies looking to control costs, especially for data that isn't queried frequently, a datalake can be more cost-effective. By storing data in cloud object storage (like S3 or GCS) and only spinning up compute when needed, you can significantly reduce storage and processing costs compared to always-on data warehouse solutions.
  • Mixed-Compute Support: If you’re building data sets that are intended to be written or queried by a mix of computing constituents, internal or even external to the company, Iceberg's ability to work with multiple compute engines becomes a major advantage. Traditional data warehouses often lock you into their specific query engine. You can optimize for your use cases, mixing high scale analysis, and realtime streaming by supporting both at the same time.
  • Open Format Preference: Organizations that prefer open standards and want to avoid vendor lock-in should lean toward Iceberg. This open format allows for easier migration and integration with other tools in the future.

Achilles Heal: Iceberg Catalog

Iceberg has a lot of potential wins, but it’s not perfect. In theory, all of these open standards enable easy mixed storage and querying wherever to whatever. In practice today, you’re going to have a hard time. A lot of this comes down to the Catalog.

This isn’t a data catalog like Atlan or Select Star. Iceberg requires a service running somewhere to act as the authority for updates. Any writes need to go through it so that metadata can be updated consistently, and all readers know what the current state is. The Iceberg Catalog handles this.

Today, it may be easy to start storing data in Iceberg. Picking the Iceberg Catalog that makes your data maximally usable is another matter entirely. Just take a look at the catalog support matrix below.

  Snowflake Databricks BigQuery

Starburst/Trino

Dremio Spark PyIceberg duckdb
Catalog Support                
AWS Glue ✓ (via Spark)  
BigLake Metastore   ✓ (via Spark)        
Hive Metastore   ✓ (via Spark)    
JDBC Catalog   ✓ (via Spark)    
REST Catalog (inc Tabular, Polaris) Polaris Only ✓ (via Spark)    
Nessie   ✓ (via Spark)      
Act as Catalog? Yes (JDBC) Yes (REST)            
Object Store Support                
S3
GCP  
Azure  
S3-compatible           ?    

(State of Iceberg Catalogs as of August 2024)

A few things to stand out:

  • No catalog today is supported everywhere. AWS Glue is as close as it gets at the moment. (personal favorite duckdb simply doesn’t have catalog support at all yet!)
  • Though Iceberg promised a world of store anywhere, for the most part, you’re still limited to one of the big three cloud storage providers.

Expect this to be a temporary state. Though there’s a number of catalogs in the mix, the ecosystem is largely gravitating to Iceberg REST as the standard for communicating with catalogs. It’s early days but Tabluar now owned by Databricks, and Snowflake’s Polaris both provide REST implementations.

How to Get Started

Thankfully, you can get started on your Iceberg stack today.

  • If you’re starting from scratch, Fivetran’s Managed Datalake is a great example. It can store your ingested data in object storage of your choice. From there, you can access it from any service that supports Iceberg (insert asterisk, Iceberg is early, see our post on Iceberg limitations)
  • If you have a service generating data such us analytics events or logs, you can write those into Iceberg format to make them portable.
  • However you’re generating your data, most major cloud warehouses offer an ability to chose Iceberg and object storage as the primary storage mechanism.

Over the coming weeks, we’ll be sharing more detailed articles on how Census and other data products can be used to build an Iceberg stack so make sure you subscribe and stay up to date with the latest details.