If you’ve been around the Amazon Redshift block a time or two, you’re probably familiar with Redshift’s COPY command. Well, allow us to introduce you to its partner in crime: the UNLOAD command. 🦹♂️Redshift’s UNLOAD command allows Redshift users to export data from a SQL query run in the data warehouse into an Amazon S3 bucket – essentially doing the reverse of the COPY command. ◀️
In this article, we’ve captured SQL, CLI, and SDK examples so you’re totally aware of what data unloading options are available to you (and what you need to pay extra-close attention to). Welcome to your complete Redshift data unloading guide. 👐
Why would you want to unload data from Redshift?
Before running your first UNLOAD command, consider your goal. Why do you want to unload your data from Redshift? Without a clear goal in mind, you’re susceptible to inefficiency traps that might delay your data operations.
Maybe you want your data into other business tools
Do you want to move your data from Redshift into other business apps (like Salesforce or HubSpot)? Within your organization, different business teams have different needs and expertise, so some teams might use Excel while others would be prone to using a CRM.
Regardless of the different types of software used across your org, data is at the heart of all your business operations. ❤️ It plays an critical role across all your teams – empowering your teams to do their best work from a single source of truth.
Because most software that utilizes your data has a feature to import specific file types, you can use the UNLOAD query to export data out of Redshift in CSV or JSON (or other formats as needed). And, if you give your business users access to the specified buckets, you can cut yourself out as the middle man and give them self-serve access to the data they need, when they need it. 💡
Or you want to migrate it into another data storage service
Like many companies, you probably use multiple data storage services, like a relations database, a data warehouse, and a data lake. When you run the data UNLOAD query, you can generate data files and use them to move across different storage destinations – even in different cloud environments. ☁️
I mean, who doesn’t love flexible data? Better yet, you can use this unloading feature as part of you rETL pipeline to let other AWS services consume the exported data.
Either way, you should backup your data
As a best practice, before you run a manipulation query to a Redshift table, you should always create a backup to prevent a loss of data (just in case). Needless to say, there are a number of ways to back up data in Redshift, but unloading data into files can be a secure and simple backup option. 🔒
When you unload data into an S3 bucket, you store that data in an isolated external location where it can then be used as a table snapshot.
What are some Redshift-specific methods to unload your data?
Redshift users can unload data two main ways:
- Using the SQL UNLOAD command
- Downloading the query result from a client
UNLOAD SQL
The most convenient way of unloading data from Redshift is by using the UNLOAD command in a SQL IDE. You can open Redshift’s native editor by going to https://console.aws.amazon.com/redshift/.
This method is particularly effective when you want to export really large data sets because you can define various parameters in the syntax to control export operation. The sample code below shows a simple example.
UNLOAD ('select * from students')
TO 's3://your-bucket/students/'
IAM_ROLE 'arn:aws:iam::1234567890123:role/MyRedshiftRole'
CSV;
The first parameter of the UNLOAD command is the SELECT query that you’ll use to define what data you’re exporting. Psstt… 👂 Don’t forget your parentheses and quotes!
You can add WHERE conditions or JOIN with other tables just like your normal SQL queries, but the TO parameter is critical; this is where you define the target S3 bucket where you want to unload the query result data into.
As a prerequisite, you will need to have the S3 write permission in your role. For a CSV export, you can simply add the CSV option.
But your customization preferences don’t end there! Redshift provides many more options to manage your data export, as shown.
UNLOAD ('SELECT * FROM ...')
TO 's3://bucket/prefix'
authorization
[ option [ ... ] ]
where option is
{ [ FORMAT [ AS ] ] CSV | PARQUET
| PARTITION BY ( column_name [, ... ] ) [ INCLUDE ]
| MANIFEST [ VERBOSE ]
| HEADER
| DELIMITER [ AS ] 'delimiter-char'
| FIXEDWIDTH [ AS ] 'fixedwidth-spec'
| ENCRYPTED [ AUTO ]
| BZIP2
| GZIP
| ZSTD
| ADDQUOTES
| NULL [ AS ] 'null-string'
| ESCAPE
| ALLOWOVERWRITE
| CLEANPATH
| PARALLEL [ { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE [AS] max-size [ MB | GB ]
| REGION [AS] 'aws-region' }
AWS SDK
AWS provides SDKs for most major programming languages to allow users to programmatically interact with AWS resources. Using the SDK, you can also run SQL queries including UNLOAD, COPY, DML, and DQL queries with Redshift.
For example, you can execute a select query, get the result in DataFrame, transform it, and then export it into S3 bucket or into your local storage. To give you an idea of the syntax, we will use examples of Python SDK, boto3. The SDKs for other languages will work similarly.
To initialize the library, import it and declare a client.
import boto3
client = boto3.client('redshift-data')
Then, you can send a query request to your Redshift cluster, like so.
response = client.execute_statement(
ClusterIdentifier='string',
Database='string',
DbUser='string',
Parameters=[
{
'name': 'string',
'value': 'string'
},
],
SecretArn='string',
Sql='string',
StatementName='string',
WithEvent=True|False
)
Put your SQL query inside the SQL field. After it’s executed, it will return a response with a UUID of your query result.
{
'ClusterIdentifier': 'string',
'CreatedAt': datetime(2022, 7, 1),
'Database': 'string',
'DbUser': 'string',
'Id': 'string', # UUID of your requested query
'SecretArn': 'string'
}
From the response, get the Id value and use the ‘get_statement_result’ to retrieve your query result.
response = client.get_statement_result(
Id='string',
NextToken='string'
)
If your data set is big, you will have to paginate using the NextToken value. For more details about the response, check the boto3 RedshiftDataAPIService document.
AWS CLI
In addition to the SDK, you can use AWS CLI to interact with Redshift, but before using it, you’ll need to download and install the command line tool. For install instructions, check this AWS document. The Redshift command will look like the following.
aws redshift-data execute-statement
--region us-west-2
--secret arn:aws:secret:us-west-2:123456789012:secret:my-secret
--cluster-identifier mycluster-dev
--sql "select * from table_name limit 100"
--database dev
Notice that the command structure is similar to that of SDK. Write your query in the --sql option, then, when the command is executed, you will be able to receive an Id as part of the response. Using that value, you can retrieve the query result.
aws redshift-data get-statement-result --id f3r3fdfsdf-sdf3-sdf3-113f-dsfjkwe8wr23
SQL client
If your data set is not particularly large (or if you just want an ad hoc way to export your data), downloading it from your SQL client directly is the simplest and easiest way.
Exporting data in CSV format is the most common feature in SQL client software, but you do have other formats available. After you execute your query, a result will show up in your client with an option to download that result into your local storage. Since each software has a different menu, the best way to find the option is by referring to its user guide. 📖
Redshift UNLOAD parameters and options
Whether you use AWS SDK, the CLI, or your SQL client, you’ll need to know what options you have at your disposal to export the data exactly as you want it. Check out these UNLOAD options so you can see all the possibility that you can achieve with the UNLOAD feature. The more you know, right?
- File types: You can choose either CSV, JSON, or PARQUET. If you plan to unload your data in order to import it into another software, you’ll need to know what file type is compatible with the recipient app. If the target tool has an import feature, it’s likely to support CSV format.
- PARTITION BY ( column_name [, ... ] ) [INCLUDE]: If you want to split your exported data by specific columns (like a date field), add this option with the date column name. When you add the INCLUDE option, that column will be included in your export.
- BZIP2, GZIP, or ZSTD: By adding one of these options, your data will be exported into one of these compression file types. Using this option, you can make a big difference in file size.
- ADDQUOTES: This option ensures a potential wrong delimitation with a false comma in a column. For example, if a value for a column is “apple, orange, banana”, this single column can be recognized as three columns due to the commas. Using wrapping double quotes, you can prevent a potential malformat.
- PARALLEL: By default, when you execute the UNLOAD query, it will generate multiple files. To create a single file, add PARALLEL FALSE.
- DELIMITER AS 'delimiter_character': Sometimes, a comma exists in a column, which is fine unless it leads to a malformat in your CSV file. By default, the delimiter will be a comma (,), but igf you want to change it to a pipe (|) or a tab (\t), consider this delimiter option.
- Max file size: You can specify the maximum size of the files that are unloaded into your S3 bucket (between 5MB and 6.2GB). By default, the maximum size is 6.2GB.
Is there a better way?
AWS offers three major ways to unload data from Redshift, plus you can use your SQL client to quickly download query result, but the UNLOAD command is just a vehicle. Is unloading your data to S3 really the best way to achieve your goal? 🤔
If you're pulling your data out of Redshift and into S3 just to move them to another system, consider Census rETL instead. Census has over 85 destination integrations for your Redshift data – from Algolia to Zendesk – helping you get your data where you need it (without the incurred frustration from the added step). With sync speeds over 4,000 records per second, detailed observability, and 50 fields for free, it’s easy to get started and grow to any scale.
Book a demo with a product specialist to see how Census can help you operationalize your data. 💪