Using LLMs to auto-categorize your data

Sean Lynch
7 April 2025

One of the most common use cases for Census’s AI Columns is taking data and applying a standard set of categories (or labels, or tags).

A category is really simple in concept but having it gives you a powerful tool to make large, complex, and evolving datasets much easier to understand and navigate. It also makes it much easier to spot emerging trends quickly. Categories give you a way to summarize data that is potentially long and messy such as user feedback, or that has many facets that need to be taken into account, such as determining the potential value of a new lead based on firmographic data.

Reliably applying categories can take a lot of work on messy data but that’s where LLMs like OpenAI, Claude, and Gemini can be a huge help, and Census makes it easy to apply these techniques at scale with AI Columns.

AI Columns are a powerful tool to use AI to quickly improve and extend your datasets with Census. But the techniques below can be applied manually as well. As we many LLM-based workflows, it’s the structure and context in your prompts to the LLM that do the heavy lifting.

Defining your categories

You’ll need to know the set of categories to apply to your data as a starting point. If you’re already doing this categorization manually, you’ll have a good starting point. A few things to consider:

  • Do you want to apply a single category, or do you expect multiple categories to apply?
  • It may be helpful to have an example of data that should fall into your category.
  • How do you want to handle situations where none of your categories apply? Providing an Other/Unknown option could be helpful, but could also reduce the value of your categories.

If you don’t yet have a set of categories, you can also use AI to suggest a potential set of categories for you! See below for a potential approach. One thing to keep in mind, it can be tempting to ask an LLM to come up with a new category on the fly if none applies. This can result in noisy data so it’s worth having a defined set of categories to start.

Writing the prompt

The prompt is at the heart of your AI Column. It will need to do a few things:

  • Include the template variables that will include all the column values you want to consider when categorizing
  • The list of categories you’d like the AI Column to use. Optionally you may want to experiment with adding a good example of a record that fits that category.
  • Any instructions on how to use the categories. For example, you can provide instructions to apply a single category, or if you’d acceptable multiple, how to format multiple categories (ie in a comma delimited list). You can also include instructions on how to handle situations where no category applies and what to do in that situation.
  • Prompt engineering is a bit of an art at this point. Including things like “Please only include the exact category name” as part of your prompt can improve the consistency of the response.

If you’re using a single category for each row, you can use the Enum datatype in Advanced Options and provide the specific options you expect. Census can then enforce that only those exact values are returned making your data much more easier to consume afterward.

Once you’ve written your prompt, you can use preview to see how AI would categorize some of your records. Now is a great time to tune or add more examples if the categorization doesn’t seem accurate.

Understanding the results

Once you’ve saved and run your AI Column, your data will now be available in your dataset, you can immediately start syncing them. But usually its helpful to do a little quality control first.

In this case, we’ll create a new Basic Dataset with SQL so we can look at the results of the AI Column.

  • If you’re using a Census Store Dataset, you can query it directly by referencing datasets.resource-id-of-your-dataset
  • Otherwise, if you’re using your own warehouse, you’ll need to find the result of the AI Column separately. This involves looking at the sync powering the AI Column and the destination table name. AI Column results are stored as individual tables in the CENSUS schema and will need to be joined with your original dataset manually.

Let’s do a quick analysis of the results we’ve got from categorization:

SELECT census_third_party_result_column, count(*) 
FROM CENSUS.dataset_column_chatgpt_a0f7727d_1234_5678_9012_6110cc9b736a
GROUP BY ALL
ORDER BY 2

This will show the count of all records that have been assigned your various categories. You can use this high-level analysis to make sure your categories have been assigned appropriately. If any other “weird” answers have made it through, you’ll see it here as well. Note that records that weren’t categorized because of an API issue will not appear here. You’ll need to join these results against your original dataset in order to see which rows weren’t categorized.

You can also spot check the records that made it into individual categories.

WITH original_dataset AS (
[The SQL for your original dataset goes here]
)

SELECT
ai_col.census_third_party_result_column AS category,
original_dataset.*
FROM CENSUS.dataset_column_chatgpt_a0f7727d_1234_5678_9012_6110cc9b736a ai_col
LEFT JOIN original_dataset
ON ai_col.census_third_party_unique_id_column = original_dataset.id
WHERE census_third_party_result_column = 'Category You Want To Inspect'
LIMIT 10

Once you’ve inspected your work, you can use them in Segments, Syncs, or other workbench transformations including Entity Resolution. And categorization is just one of the many things you can do with AI Columns!. Take a look at our blog for other powerful ways to use AI to improve your data in Census!

 


Advanced Mode: Using AI to define your categories

If you have data you’re not sure how to categorize, you can also leverage AI to generate the categories you want to use. This example will use a second temporary AI Column to generate the list of categories, but you can also use these steps outside Census directly with your preferred LLM.

Step 1: Aggregate the data

With the dataset you intend to categorize, you’re first going to need to generate a single block of text with all the data you want to categorize. You’ll want to use SQL aggregate functions to return a single row with the full text (as the text will then be passed to the AI Column).

A few tricks to keep in mind:

  • If you’re using a single value to categorize, you can use a LISTAGG function to merge all the values across multiple rows into a single list.
  • If you’re using multiple values, you’ll need to merge the values within each row using CONCAT to create a set of values
  • In either case, aim for outputting data that is human readable. Use newlines and dashes as bullets to format the data into a list or a list of groups.
  • Your data might be quite large and not fit in the context window in the next step. You can use ORDER BY RANDOM() LIMIT 1000 to produce a random sampling of your dataset, or simply use LIMIT

⚠️ Note that if you’re using Census AI Credits, your token window will be much smaller, potentially too small to apply this technique. We recommend you provide your own API key in order to avoid hitting this limit.

Step 2: Temporary AI Column to generate categories

With the data merged into a single row, you can now use AI Columns to generate categories. The important part here is that you’re giving all of your data (or at least a sample of it) to the LLM in one shot now, as opposed to a per row basis.

Your prompt should look something like so:

Use the attached sample of record data below to recommend a set of categories. 
- Your provided set of categories should cover the majority of examples with exactly one catgory.
- Your categories should be meaningfully distinct from each other
- These categories should be useful for [YOUR BUSINESS PROBLEM]
- Only if necessary, you may include an unknown or other option.

Here is the example data: record['SUMMARIZED_DATA']

The more instructions or hints you can include, the better categories you’ll receive for your business.

Pressing preview will generate your set of categories for your data. Once you have them, you don’t need to save your AI Column. Just copy and paste your categories into your prompt and continue with your actual AI Column.