Why we chose Iceberg as the foundation for Census Store

Ellen Perfect
11 March 2025

This week, we launched Census Store--A new way of using Census that allows you to build and transform datasets without setting up a warehouse. We did this because we believe:

  • Iceberg is affordable enough to democratize access to best-in-class data management strategies
  • Iceberg is flexible enough to provide a valuable, queryable catalog for companies who do use a data warehouse and have their own object storage
  • Affordable storage frees up resources for more complex compute -- and with AI becoming more important to businesses than ever, that matters.

What is Iceberg Storage?

Apache Iceberg is an open-source table format designed for big data stored in cloud or distributed storage like Amazon S3. It allows you to manage huge datasets efficiently while keeping track of changes (like updates, deletes, and inserts) in a structured way.

There are three components that make Iceberg storage work:

  • Object Storage: Most commonly, this is an S3 bucket, an Azure blob, or Google Cloud storage. It typically has a flat structure, simply attaching files to directories without adding relationships or hierarchies. It excels at storing raw data in any format (Parquet, JSON, CSV, images, etc) affordably, but sacrifices easy organization. 
  • REST Catalog: It's essentially the table of contents for your data. It keeps track of what files exist, where they are, and what files are associated with them. It does all of this in a structured, queryable way. This can be used with any tool that understands the REST format. (like Trino, Spark, or Snowflake)
  • Query engine: You don't need a data warehouse to use Iceberg storage, but most companies will want a place to build and manage datasets in order to action on the data they store. Connecting to the REST catalog gives the warehouse structured data to query.

Screenshot 2025-03-09 at 10.52.57 AM

Why not cut out the middle man and connect the warehouse to the object storage?

1. Object Storage Doesn't Organize Data Like a Database

  • S3 (or any object storage) is just a bucket of files—it doesn’t know anything about tables, columns, or versions.
  • A data warehouse needs metadata to understand how to interpret the data files.
  • Without something like Iceberg's catalog, the warehouse would have to scan all the files every time you run a query, which is slow and expensive.

2. No Transaction Management

  • If multiple users update or delete data at the same time, there's no built-in way to prevent conflicts.
  • Iceberg ensures ACID transactions so that reads and writes don’t interfere with each other.

3. No Schema Evolution

  • If your schema changes (e.g., adding a new column), object storage doesn't track those changes.
  • Iceberg handles schema evolution smoothly, so your queries won’t break.

4. Faster Query Performance

  • Without Iceberg, querying S3 directly often means scanning every file.
  • Iceberg indexes metadata and allows partition pruning, so your queries only scan the relevant files.

5. Time Travel & Version Control

  • If you store raw Parquet files in S3, you can’t query past versions of your data.
  • Iceberg keeps snapshots, allowing you to time travel back to previous versions.

The True benefit of Iceberg

Fast and Efficient – It avoids scanning unnecessary data, making queries faster.
Time Travel & Snapshots – You can roll back to older versions of data.
Schema Evolution – You can change table structure without breaking old data.
Works Across Engines – Any system supporting Iceberg can read the data (e.g., Spark, Trino, Snowflake).
Cost-Effective – It reduces data duplication and unnecessary file rewrites.

There you have it! We're excited about the future of affordable data storage--read more about how we're bringing this to life in our product in the Census Store docs.