Engineering

How to query Iceberg locally using Spark, PyIceberg, or duckdb

Sean Lynch

8 October 2024

A surprising amount of Iceberg content today is written assuming you’re reading and writing from files on disk. That really misses the valuable of Iceberg which is specifically designed to enable a distributed network of readers and writers. If you’re simply running on one machine, stick to a duckdb or sqlite and safe yourself the pain!

But when you’re getting started with Iceberg data in the cloud and behind catalogs, it’s still really helpful to get your queries running locally Thankfully, working from a laptop is straightforward.

Getting an Iceberg client running locally requires thinking about three things:

What kind of catalog does your target Iceberg use?
Where is the underlying data stored and what credentials do you need to access it?
Do you plan to read only or do you also need to write?

There’s a couple of ways to different tools that can be used as Iceberg clients and the answers to these questions will help determine which of these are an option for you.

	Runtime Language	Query Support	Catalog Support	Write/DDL Support
Spark	Java + Python	SQL	Everything	Yes
PyIceberg	Python	None	Most	Yes
duckdb	Python	SQL	Limited	No

We’ll go through each option below. Note that installing runtimes and packages differ by operating system and we won’t go into that detail here.

Spark

Catalog Support	Storage Support	Write/DDL Support
Basically Everything	Local + S3, GCS, and Azure	Yes

The Spark project in a lot of ways is responsible for the success of Iceberg. Having a capable/scalable and open source computing project drives the need for the corresponding scalable and open storage standard. If you’re a Databricks user, chances are you’re already familiar with Spark and the fastest way to get access to an Iceberg client is to simply sign up for Databricks Community Edition. But you can also run it locally.

Though you can run Spark in Java or Scala, the most common way to run Spark locally is by using PySpark, a wrapper around Java-based Spark. Even though you’ll be working in python, you’ll still need to have a version of Java locally.

Setting up

Confirm that java command works and install it if you’re missing it.
Then install PySpark (preferably within a virtual environment)

pip install "pyspark[sql]"

Querying

Your python code can be run as a script or through the python repl. Here’s what a connection string looks like (in this case, to a Polaris Iceberg Catalog using the Iceberg REST protocol). All the configuration is a bit daunting but that’s also why Spark can claim the best support.

import pyspark

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName('iceberg_lab') \
        .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160') \

        .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \

        .config('spark.sql.iceberg.vectorization.enabled', 'false') \

        .config('spark.sql.defaultCatalog', 'mycatalog') \

        .config('spark.sql.catalog.mycatalog', 'org.apache.iceberg.spark.SparkCatalog') \

        .config('spark.sql.catalog.mycatalog.type', 'rest') \

        .config('spark.sql.catalog.mycatalog.uri','<https://abc123456.us-east-1.snowflakecomputing.com/polaris/api/catalog>') \

        .config('spark.sql.catalog.mycatalog.credential','CLIENT_ID:CLIENT_SECRET') \

        .config('spark.sql.catalog.mycatalog.warehouse','catalog_name_in_polaris') \

        .config('spark.sql.catalog.mycatalog.scope','PRINCIPAL_ROLE:ALL') \

        .config('spark.sql.catalog.mycatalog.header.X-Iceberg-Access-Delegation','vended-credentials') \

        .getOrCreate()

There’s a lot here and your connection will likely look different, but lets explain a bit about what all of these mean.

spark.jars.packages specifies the java packages needed to support your particular connection. Different catalogs and object stores will require different combinations of. In this case, my Iceberg data is stored in S3 so I need to pull in awssdk as well. Spark will take care of downloading the jars for me.
spark.sql.extensions adds some commands to SparkSQL for interacting with Iceberg.
Spark lets me “mount” different catalogs whereever I want in the SQL namespace when setting up a session. In this case, I’m configuring a catalog called mycatalog and also making it the default with spark.sql.defaultCatalog. This means that when I query tables in that catalog, I can skip the top level catalog name.
My new catalog is SparkCatalog of type rest , so it’s going to use the Iceberg REST protocol to talk to it. I also provide it a uri and some credentials that are specific to this type of catalog, others might use OAuth for example.
There’s also some catalog specific configuration. It’s likely that every catalog will some of their own quirks to use or navigate. For example:
- The catalog I’m connecting to is using Snowflake’s managed iceberg tables to create tables, and writes data using Parquet V2. As a result, I need to disable an Iceberg feature to keep Spark happy using spark.sql.iceberg.vectorization.enabled: false
- Polaris supports Access Delegation with the spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation: vended-credentials configuration. In this case, the catalog will give me the credentials I need to talk to object storage without me doing any additional configuration.
If I couldn’t use Access Delegation, I’d need to provide S3 (or any other object store) credentials myself:

        .config('spark.sql.catalog.mycatalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO') \

        .config('spark.sql.catalog.mycatalog.s3.access-key-id','<aws_access_key_id>') \

        .config('spark.sql.catalog.mycatalog.s3.secret-access-key','<aws_secret_access_key>') \

        .config('spark.driver.extraJavaOptions','-Daws.region=us-east-1') \

Getting the correct connection incantation is absolutely the hardest part. And it’s possible you may not see all of the permissions issues until you’ve started querying. But once you’ve got your connection, you’re ready to start writing SQL. Start with some simple ones:

spark.sql("SHOW NAMESPACES").show()

spark.sql("SELECT 1").show()

One of the main benefits of Spark is you can write data and perform other DDL actions. All of these should work as well:

spark.sql("CREATE NAMESPACE spark_created").show()

spark.sql("CREATE TABLE spark_created.trivial AS SELECT 1 AS id;").show()

spark.sql("SELECT * FROM spark_created.trivial;").show()

spark.sql("CREATE VIEW spark_created.trivial_view AS SELECT 1 AS id;").show()

PyIceberg

Catalog Support	Storage Support	Write/DDL Support
Many but not all	Local + S3, GCS, and Azure	Mostly Yes (some operations not yet supported)

PyIceberg is a direct python implementation meant for managing Iceberg tables so it doesn’t have dependencies on Java at all which is nice. The downside is that it also don’t support a SQL dialect, at least directly.

Unlike PySpark, you’ll need to manage installing the appropriate packages yourself when calling pip. In this case, I’ll need S3 to access the object storage, and pyarrow to access the data in memory.

pip install pyiceberg[pyarrow,s3fs]

The connection string is a little more straightforward. You don’t “mount” catalogs the way Spark does, instead you have standalone catalogs.

from pyiceberg.catalog import load_catalog


catalog = load_catalog(

    "rest",

    **{

        "uri": "<https://abc123456.us-east-1.snowflakecomputing.com/polaris/api/catalog>",

        "credential": "CLIENT_ID:CLIENT_SECRET",

        "scope": "PRINCIPAL_ROLE:ALL",

        "warehouse": "snowflake_managed",

    })

The catalog connection is a bit more straight forward for a few reasons. There’s no need to manage packages here (that was done during the pip step), and some of the additional Polaris-specific configuration is not needed or handled automatically by PyIceberg.

As mentioned, PyIceberg doesn’t support querying itself. Instead, you use it to pull tables from Iceberg and load them into memory. In this case, we’re using Arrow which is optimized for dealing with a lot of data in-memory.

results = catalog.load_table("SNWFLK_DATABASE.MY_SCHEMA.TABLE_NAME")

                 .scan(limit=100).to_arrow()

If you do want to do some querying though, there’s another great in-memory representation you can use: duckdb. More on using them together below.

duckdb

Catalog Support	Storage Support	Write/DDL Support
Hadoop/Filesystem-style Catalog only	Local + S3 (maybe Azure, GCS, as well as S3 compatible)	No

I’m a huge fan of duckdb as a tool for a wide variety of data analysis. Unfortunately, it comes up a bit short on Iceberg support at the moment. Specifically:

Only three functions for working with Iceberg currently
It does not support any data writing or other data structure manipulation.
It largely does not yet support Iceberg catalogs, except the old Filesystem-style Iceberg Catalog that relies on having a version-hint.txt file (a pattern that’s slowly going away). So it’s up to you to find the right metadata file and point duckdb at it directly.

If you don’t already have duckdb installed, you can also install it with pip:

pip install duckdb
duckdb

Now, within a duckdb session, you’ll need to install the extensions

con.install_extension('httpfs')
con.load_extension('httpfs')
con.install_extension('iceberg')
con.load_extension('iceberg')

You’ll also need set up credentials (docs) to talk to the underlying object storage.

CREATE SECRET secret1 (
    TYPE S3,

    KEY_ID 'AKIAIOSFODNN7EXAMPLE',

    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',

    REGION 'us-east-1'

);

Finally, you’ll need to find the path to the current metadata file for your target table. For example, in Snowflake, you can use the following command to return the path to the underlying storage of a Snowflake Managed Iceberg Table:

SELECT SYSTEM$GET_ICEBERG_TABLE_INFORMATION('SNWFLK_DATABASE.MY_SCHEMA.TABLE_NAME');

Once you’ve got all that, you should be ready to query it!

SELECT COUNT(*) 
FROM ICEBERG_SCAN('s3://my-s3-bucket/path/to/table_name/metadata/00001-1df49ecf-b27e-41b8-ae65-188bc1a24416.metadata.json');

PyIceberg + duckdb

The reality is these two should be used in combination with each other. In the future, you may only need one or the other, but today, PyIceberg can provide the catalog interface and duckdb the query engine.

The first step is to have PyIceberg interact with the catalog then return a duckdb connection instance. You’ll need to install duckdb along side pyiceberg to do that:

pip install "pyiceberg[duckdb]"

# load a pyiceberg catalog as before


conn = catalog.load_table("NAMESPACE.TABLE_NAME").scan().to_duckdb(table_name="table_name")

Now all of duckdb’s querying is available just as before. You can even use this same tag-team approach in reverse to the results of a query and write them back via PyIceberg


results = conn.execute("SELECT * FROM updated_table").arrow()

catalog.load_table("NAMESPACE.UPDATED_TABLE").overwrite(output)

Wrapping Up

This should get you going on any of the most common Iceberg clients. If you’re having issues though, don’t be too surprised. This ecosystem is very rapidly evolving. Even over the course of writing this, I found the teams at Databricks and Snowflake rapidly fixing bugs I ran into. If you’ve searched a few resources and haven’t been able to resolve, plan to report your issues to the catalog or database owner (provide them a clear, narrow reproduction of your issue to help make it actionable). And hopefully in the near future, there won’t be a need for an article like this.