Snowflake external functions for data enrichment with AWS example | Census

Stephen Allwright
29 September 2022

Your Snowflake instance is likely full of data that’s been collected over the years. While, on the surface, it might seem like you can’t use data from years ago, don’t dismiss it just yet – that data is useful

It can provide value to you even today, guide your key decisions and help you to understand changes in your company. But what if you could enrich it even further by incorporating code from outside of Snowflake? 📈

This is where External Functions in Snowflake come in — and that’s exactly what we'll be looking at in this article.

What are external functions in Snowflake?

An external function is a user-defined function (UDF) that calls code that’s been stored and executed outside of Snowflake. Whether that code is on AWS, GCP, or Microsoft Azure cloud platforms, these functions give you the possibility of accessing API endpoints using SQL called within Snowflake. 

Simply put, with external functions, you: 

  • Increase available possibilities within Snowflake
  • Reduce the need for data pipelines when using external endpoints

How easy is it to set up Snowflake external functions?

It sounds like external functions are great, right? So the big question is: How easy is this to actually set up? 🤔

The answer to that question depends on your starting point. 📍 If you have an existing API that you would like to make requests on then this is a relatively easy process. However, if you need to create and maintain a new API then you’ll need some software engineering expertise to help with the process (which requires significantly more time and effort).

Step-by-step example for creating external functions for customer data in Snowflake

Let’s look at a typical use case for using external functions.

In this tutorial, we’ll create an external function to enrich sales leads with a ranking score created by an existing machine learning model API. The end result? A much more valuable dataset that you can utilize in an external tool to better prioritize sales leads. 🔥

What you need to get started 

To complete this tutorial there are a few items that you will need to have in place 👇

  • An existing API endpoint in a cloud platform of your choice (AWS, GCP, or Azure)
  • Account admin role in Snowflake

For this example, we’ll assume that we’re creating this integration with an API endpoint in AWS.

Integrate a machine learning API with Snowflake 

First, we’ll create an API integration object between Snowflake and the API so we can use our external machine learning model’s API in the external function.

To do this, we use the command CREATE OR REPLACE API INTEGRATION in our account admin role:

CREATE OR REPLACE API INTEGRATION <api_integration_name>
  api_provider = <api_provider_name>
  api_aws_role_arn = <IAM_role_ARN>
  api_allowed_prefixes = <resource_url>
  enabled = true
;

CREATE OR REPLACE API INTEGRATION leads_score_api_integration
  api_provider = aws_api_gateway
  api_aws_role_arn = '<IAM_role_ARN>'
  api_allowed_prefixes = ('https://…')
  enabled = true
;

The specific fields for this command will change slightly depending on the cloud platform that you’re using, but for our AWS example, the fields here are:

  • <api_integration_name> - The name we want to give to our API integration
  • api_provider - The object name of the API provider we’re using (there are default object names for each platform)
  • api_aws_role_arn - The Amazon Resource Name (ARN) for our Identity and Access Management (IAM) role
  • api_allowed_prefixes - The resource invocation URL

Authenticate Snowflake with your cloud platform

Now that the integration object between Snowflake and the API has been created, we need to authenticate that usage so that the request will be trusted. 

You can do this in your IAM role within the cloud platform GUI. The details are found by running the command DESCRIBE INTEGRATION.

DESCRIBE INTEGRATION leads_score_api_integration;

For AWS, the fields returned by DESCRIBE INTEGRATION that we will need to create a trusted relationship are:

  • API_AWS_IAM_USER_ARN
  • API_AWS_EXTERNAL_ID

Create an external function object in Snowflake

Since the integration and trusted relationship between our Snowflake instance and the API endpoint have been created, we can use this integration object to create our external function that will enrich our sales leads data. 

To do so, we’ll run the command CREATE EXTERNAL FUNCTION with details of the incoming and outgoing data.

CREATE EXTERNAL FUNCTION <name>(<arg_name> <arg_data_type>)
    RETURNS <result_data_type>
    api_integration = <api_integration_name>
    AS <url_of_proxy_and_resource>
    ;

CREATE EXTERNAL FUNCTION leads_score(number_of_days_since_signup integer, number_of_visits integer, total_time_spent float)
    RETURNS variant
    api_integration = leads_score_api_integration
    AS '<url_of_proxy_and_resource>'
    ;

The fields in this command that we need to fill in are:

  • <name> - The name we want to give to our external function
  • <arg_name> and <arg_data_type> - The external function’s input arguments and corresponding data types
  • <result_data_type> - The data type that we can expect to receive from the model
  • <api_integration_name> - The name of the integration that we created earlier
  • <url_of_proxy_and_resource> - The invocation URL of the proxy service that Snowflake uses to call the model

Use Snowflake external functions to enrich data in a view or table

Now, we can use the external function object that we created in the previous section to enrich data in our instance. To put the function to work, we can use it as part of a CREATE statement for a view, a table, or a simple ad-hoc select statement. 

The syntax for using the external function in a query is similar to calling a function in other programming languages. Let’s look at a simple select statement example where we want to score our sales leads.

SELECT lead, leads_score(number_of_days_since_signup, number_of_visits, total_time_spent) as lead_score FROM sales_leads;

Incorporating this function into a create statement will look very similar – we’ll simply create a new view with our newly scored sales leads.

CREATE OR REPLACE VIEW scored_sales_leads AS 
SELECT lead, leads_score(number_of_days_since_signup, number_of_visits, total_time_spent) as lead_score FROM sales_leads;

Parsing the JSON object from the remote service into Snowflake

While the process of using our new external function may seem simple on the surface, Snowflake is doing some intense work behind the scenes that we can’t see. In order to do this work though, the returning data needs to be in the correct format.

When our leads_score external function is run, the remote service returns a JSON object that is parsed by Snowflake. For Snowflake to be able to interpret that data correctly behind the scenes, it should be in the following format:

{
"statusCode": <http_status_code>,
"body":
        {
            "data":
                  [
                      [ 0, “High value” ],
                      [ 1, “Medium value” ],

                      [ 2, “Low value” ],
                      ...
                  ]
        }
}

Let’s walk through each part of this data structure:

  • statusCode - The status of the request
  • body - The only element in the body is the data object
  • data - This is the data that is parsed by Snowflake. It should contain one row for each row sent by Snowflake and contain:
    • The row number as a 0-based index
    • The values returned by the function, which in our case is the ranked score of the sales lead

Advantages and disadvantages of using external functions

There’s no doubt that using external functions in your Snowflake instance will bring extra value. The key advantages of using them – which we’ve also seen in this tutorial – are

  1. Data enrichment directly in Snowflake without the need for external data pipelines
  2. The code for the remote service can be written in the programming language you wish
  3. The external code can use functions and libraries that are not accessible natively within Snowflake

While there are some tempting upsides, there are also a number of disadvantages that should be considered when using external functions, including:

  1. Time and opportunity-cost required to create and maintain the integration with the API (as well as the API itself)
  2. Snowflake has no guarantees regarding frequency and duplications when calling external functions
  3. The remote service has to be hosted on either AWS, Azure, or GCP
  4. Each cloud platform has specific requirements for its integrations

Send enriched Snowflake data to CRM tools with the help of Census

In this tutorial, we were able to enrich our sales data with an external machine learning model to better prioritize our sales leads. Although this is helpful, the real value comes from using this data in the tools that our sales professionals use to contact these leads. To do this, you need a tool that can easily and securely move this data. This is where Census comes in. 💪

Census is the leading reverse ETL tool available, allowing you to move data from Snowflake into all of your tools without needing to spend time setting up and maintaining these integrations. Instead, you can focus your time and resources on what matters: Getting value from and taking action on your data.

👉 Want to start syncing your data? Book a demo with one of our product specialists to learn how we work for your specific operational analytics needs.