Using Managed Iceberg Tables to Build a Datalake Inside Snowflake

Sean Lynch
12 September 2024

Snowflake now offers a Managed Iceberg Tables which lets you build up an Iceberg datalake natively from within Snowflake.

Snowflake adds a new object called an EXTERNAL VOLUME specifically for interacting with Iceberg table in object storage. It can be assigned to Snowflake at the Account, Database, or Schema tier. As part of this, you’ll need set up the IAM controls that will give Snowflake access to the storage (Snowlake’s instructions).

Now you can create the external volume (using S3 as the example, but Azure/GCP are similar):


CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume

   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-us-west-2'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3:///'
            STORAGE_AWS_ROLE_ARN = ''
            STORAGE_AWS_EXTERNAL_ID = 'iceberg_table_external_id'
         )
      )

ALLOW_WRITES = TRUE;

At least in the S3 case, you need to DESC EXTERNAL VOLUME on your new volume to get a few details that get added to your IAM role.

Now you can CREATE ICEBERG TABLE to create new Snowflake-managed Iceberg tables in the external volume.


CREATE OR REPLACE ICEBERG TABLE my_iceberg_table (

    boolean_col boolean,
    ...
)

CATALOG = 'SNOWFLAKE' -- Snowflake itself will act as the catalog
EXTERNAL_VOLUME = 'iceberg_external_volume' -- Unnecessary if you set a default on the account/database/schema
BASE_LOCATION = 'my/relative/path/from/extvol';

There’s no direct way to convert an existing table so you’ll need to copy/replace if you’re doing that. Keep in mind it may require you to change API types. For example, I had to write a script to replace my TIMESTAMP_TZ(9) (Snowflake’s standard precision) with TIMESTAMP_NTZ(6).

Snowflake-managed Iceberg tables also get a Snowflake-managed Iceberg Catalog with no additional work, but it’s limited relative to other catalog implementations:

  • External services can only read. No writing.
  • It only supports JDBC connections, so Spark environments and Databricks can connect.
  • Services that don’t support JDBC are out of luck. duckdb doesn’t support catalogs at all at the moment, but does support low level object storage (duckdb). You can use Snowflake’s SYSTEM$GET_ICEBERG_TABLE_INFORMATION command to pull the current file path.
  • Snowflake itself cannot connect to it, at least directly. A second Snowflake can only consume the object storage currently, and it you will need to implement scanning for metadata changes yourself.

iceberg_graph

If that’s too limited for your use, Snowflake’s managed catalog does have one unique property called CATALOG_SYNC which allows syncing a Snowflake-managed Iceberg table with Polaris https://docs.snowflake.com/en/user-guide/tables-iceberg-polaris-sync: “When you modify the table in Snowflake, the changes are automatically synchronized with the external catalog in your Polaris account.“ Polaris itself supports Iceberg REST, opening up that catalog to more consumers. And it does provide a path allow other Snowflakes to connect.

Snowflake’s managed catalog is only a year old but even so, my guess is that it ends up being subsumed into Polaris.

So in the next post, we’ll take a look at what Polaris brings to the table for prospective Iceberg architectures.