A Deep Dive into Polaris: Simplifying Iceberg Catalog Management

Sean Lynch
12 September 2024

Last week, we did a dive into Snowflake’s native Managed Iceberg Tables and hit a few of the limitations of their built-in implementation. It’s a bit unfair as they’re already onto a new approach. In June, they announced Polaris, an Apache incubated Iceberg Catalog.

If you’re new to Iceberg, take a look at the post we wrote explaining what an Iceberg catalog and why it’s necessary. In that post, we discussed how catalogs still have a lot of rough compatibility issues. One exciting thing off the top is that Polaris implements the new Iceberg Catalog REST API easier cross-compatibility between data consumers. We’ll test that theory here.

First, Polaris is available as hosted service, you can sign up today if you’re a snowflake customer. Non-customers get sent to a “Coming soon” page. Otherwise, it’s an open source Apache project. You can self host it with a bit of assembly. The first commits only started landing a few months ago so the product is a bit sparse.

iceber_polaris

Polaris’s objects heavily overlap with the Iceberg standard entities. Polaris provides an additional access control layer to manage who can do what to which resources. It’s simpler than Snowflake or other warehouses, but it’s also unique. Here are the core concepts that make up Polaris:

icerberg_polaris_diagram

Catalog

The top level container/manager of a set of tables (not views yet). Tables can be grouped into namespaces, basically the equivalent of database + schema except in this case, namespaces can be arbitrarily nested within namespaces.  For example, a.b.c.d.e.f.g is a valid namespace. Gone is the standard three-tier fully-qualified-name.

In Polaris, a catalog is also associated with a set of storage credentials that can point to S3, Azure storage, or GCS. Unique to Polaris, a catalog can be one of the following two types:

  • Internal: The catalog is managed by Polaris and tables can be read and written via Polaris. Effectively the standard Iceberg catalog.
  • External: The catalog is externally managed by another Iceberg catalog provider and the tables are read-only in Polaris. Currently only Snowflake can act as an external provider but documentation states they want to support at least Glue and Dremio Arctic too.

Roles

Polaris’s access control system is built around two different types of roles:

  • Principal Roles - A service connection is a set of credentials that acts as a particular principal. Multiple connections can act as the same principal letting you hand out unique credentials per end user as an example.
  • Catalog Roles - Each catalog can define one or more roles, where a role is a collection of Privileges such as TABLE_READ_DATA , NAMESPACE_CREATE or CATALOG_MANAGE_CONTENT (everything). One noteable constraint: these privileges apply at the catalog-level, not the namespace or table level

Setting it up

Before you start, decide whether you’ll be using Snowflake to create/manage/update your Iceberg tables, or a separate process like Spark. The set up steps starts the same but there are additional steps at the end depending on where you expect to be creating and updating your Iceberg tables.

  1. Create a catalog
    • Mark it as external if you intend to have it populated by Snowflake-managed catalog.
    • Reuse the S3 credentials you set up for the managed snowflake client, but keep in mind, this will generate a new S3 user and you need to create the trust relationship.
    • While you’re here, create a catalog role that grants CATALOG_MANAGE_CONTENT for now.
  2. Create a connection
    • As part of this, create a new principal role, or reuse an existing one
    • Creates a client ID and Secret. This is what an external process like Spark needs. Snowflake managed catalog will also use this.
    • Note that the types actually do have meaning. A snowflake credential can’t be just reused for pyiceberg, need to generate a pyiceberg connection.
  3. Grant the principal role access to the catalog role.

If you’re setting up an External Catalog with Snowflake Managed Tables

Presumably you’ve already set up some Snowflake managed iceberg tables. If not, follow our guide. You need to do one additional step:


CREATE OR REPLACE CATALOG INTEGRATION your_polaris_catalog_integration 
  CATALOG_SOURCE=POLARIS 
  TABLE_FORMAT=ICEBERG 
  CATALOG_NAMESPACE='default' # namespace within the catalog to scope to 
  REST_CONFIG = (
    CATALOG_URI ='https://.snowflakecomputing.com/polaris/api/catalog' 
    WAREHOUSE = ''
  )
  REST_AUTHENTICATION = (
    TYPE=OAUTH 
    OAUTH_CLIENT_ID='' # Polaris connection credentials
    OAUTH_CLIENT_SECRET='' 
    OAUTH_ALLOWED_SCOPES=('PRINCIPAL_ROLE:ALL') 
  ) 
  ENABLED=true;

Now any existing iceberg tables will need a new CATALOG_SYNC property set that points to the integration. You can add when creating, or modify existing ones:


ALTER ICEBERG TABLE your_db.your_schema.iceberg_table 
	SET CATALOG_SYNC = 'your_polaris_catalog_integration'

Every new table or update will automatically sync to Polaris and be available to any readers. You’ll see them as you load up Spark or PyIceberg via a REST catalog. Polaris implements Access-Delegation so your Spark or PyIceberg client doesn’t even need to have the S3 credentials making it very easy to get started.


from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    "rest",
    **{
        "uri": "https://.snowflakecomputing.com/polaris/api/catalog",
        "credential": ":",
        "scope": "PRINCIPAL_ROLE:ALL",
        "warehouse": "",
    }
)

catalog.load_table("YOUR_DB.YOUR_SCHEMA.ICEBERG_TABLE").scan().to_arrow()

If you’re doing an Internal Catalog

You’re ready to go. You can use Spark or PyIceberg to start writing tables!

You can also query those tables with your Snowflake, though not write them. The unfortunate part is that Snowflake’s CATALOG_SYNC doesn’t work in reverse. In addition to creating Iceberg tables, you will need to manually create the same matching ICEBERG TABLE definition in Snowflake


CREATE ICEBERG TABLE database.schema.test_table
  catalog = 'demo_polaris_int'  ## Catalog integration
  external_volume = '' ## External Volume definition
  catalog_table_name = 'test_table' ## table name in catalog, including namespaces

SELECT * FROM database.schema.test_table;

Polaris enters the Iceberg ecosystem that, while still early, already has a lot of angles to navigate. Obviously it works well with Snowflake, but the REST API implementation is already delivering increased compatibility across platforms like Spark and PyIceberg. What remains to be seen is how much Polaris favors Snowflake specifically with things like catalog sync, and how quickly (or not so much), the rest of the offerings move to adopt REST as well.