We're excited to announce that Census AI Columns now supports Claude, bringing Anthropic's leading AI models to our data activation platform. With this addition, we now offer the complete Claude 3.5 series (Sonnet and Haiku) alongside GPT-4o and GPT-4o Mini.
Anthropic has built incredibly performant models in both Haiku and Sonnet, and we're excited to add them to our toolkit. But these new options also present another layer of complexity to using AI columns: deciding which model to use.
It's a complicated question, so let's start by understanding the raw performance numbers.
Erich Hellstrom over at PromptLayer compiled an analysis of model performance across a series of categories. Here aer the findings:
Claude 3.5 Sonnet | Claude 3.5 Haiku | GPT-4o | GPT-4o Mini | ||
Undergraduate Level Knowledge (MMLU) | 86.8% (5-shot) | 85.0% (5-shot) | 86.4% (5-shot) | 84.0% (5-shot) | |
Graduate Level Reasoning (GQPA, Diamond) | 50.4% (0-shot CoT) | 48.0% (0-shot CoT) | 35.7% (0-shot CoT) | 33.0% (0-shot CoT) | |
Grade School Math (GSM8K) | 95.0% (0-shot CoT) | 92.0% (0-shot CoT) | 92.0% (5-shot CoT) | 90.0% (5-shot CoT) | |
Math Problem-Solving (MATH) | 60.1% (0-shot CoT) | 58.0% (0-shot CoT) | 52.9% (4-shot) | 50.0% (4-shot) | |
Multilingual Math (MGSM) | 90.7% (0-shot) | 88.0% (0-shot) | 74.5% (8-shot) | 72.0% (8-shot) | |
Code (HumanEval) | 84.9% (0-shot) | 80.0% (0-shot) | 67.0% (0-shot) | 65.0% (0-shot) | |
Reasoning Over Text (DROP, F1 Score) | 83.1% (3-shot) | 80.0% (3-shot) | 80.9% (3-shot) | 78.0% (3-shot) | |
Mixed Evaluations (BIG-Bench-Hard) | 86.8% (3-shot CoT) | 84.0% (3-shot CoT) | 83.1% (3-shot CoT) | 80.0% (3-shot CoT) | |
Knowledge Q&A (ARC-Challenge) | 96.4% (25-shot) | 94.0% (25-shot) | 96.3% (25-shot) | 94.0% (25-shot) | |
Common Knowledge (HellaSwag) | 95.4% (10-shot) | 93.0% (10-shot) | 95.3% (10-shot) |
|
An overview of performance numbers
The above table shows the model's performance across a series of benchmark tests. And while we won't get too into how these performance numbers are evaluated in this piece, I do want to call out a few of the big takeaways:
- All of the models are smart: Seriously, they're all incredibly impressive. And for most lightweight tasks like content writing and basic reasoning, they can all definitely get a job done.
- None of them are brilliant at complex reasoning: See the reasoning benchmarks highlighted in yellow. All of these models will need some guidance when it comes to solving complex math and reasoning problems. The good news is that you can do some of this legwork with prompt writing.
- The biggest spread is in coding and multilingual math: See the rows highlighted in purple. These are the most important areas where you stand to lose on performance by picking a certain model.
But numbers can only tell us so much. So let's unpack this a little bit.
Unpacking AI Model Performance
The Standout: Claude 3.5 Sonnet
Claude 3.5 Sonnet emerges as the heavyweight champion in several key areas. Most notably, it achieved an impressive 84.9% score on the HumanEval coding benchmark - significantly outperforming GPT-4o's 67.0%. This isn't just a statistical difference; it translates to real-world advantages in tasks like:
- Automating complex data transformations
- Building sophisticated scoring algorithms
- Generating production-ready code for data pipelines
Sonnet also shows remarkable strength in graduate-level reasoning tasks, scoring 50.4% on the GQPA Diamond benchmark compared to GPT-4o's 35.7%. This makes it particularly valuable for complex analysis tasks like:
- Advanced customer segmentation
- Sophisticated market analysis
- Complex feature engineering
The Runner Up: Claude 3.5 Haiku
If Sonnet is the heavyweight, Haiku is the lightweight boxing champion - fast, efficient, and surprisingly powerful. While it trails Sonnet slightly in absolute performance (80.0% vs 84.9% on HumanEval), it maintains impressive capabilities while offering significantly faster processing times and lower costs.
The Versatile Performer: GPT-4o
GPT-4o shines in its consistency and well-rounded performance. It particularly excels in knowledge-based tasks, scoring neck-and-neck with Claude Sonnet on benchmarks like ARC-Challenge (96.3% vs 96.4%) and HellaSwag (95.3% vs 95.4%). This makes it an excellent choice for:
- Content generation and analysis
- Customer interaction automation
- General-purpose data enrichment
The Budget Champion: GPT-4o Mini
Don't let the "Mini" fool you - GPT-4o Mini delivers impressive performance at a fraction of the cost. With pricing at just $0.15 per million input tokens (compared to Sonnet's $3.00), it's an excellent choice for high-volume, straightforward tasks.
The best LLM for you also depends on cost
Performance isn't everything, especially when you consider that performance costs. You need a right-sized tool for the task at hand if you want to scale up your model usage without going broke. So let's talk money. Here's how the pricing breaks down:
Input Cost (per million tokens) | Output Cost (per million tokens) | |
Claude 3.5 Sonnet | $3 | $15 |
Claude 3.5 Haiku | $1 | $5 |
GPT-4o | $2.5 | $10 |
GPT-4o mini | $.15 | $6 |
As you can see, the most performant models are also pretty pricey. These costs can evolve, but these are estimates based on public information from 2024.
Balancing cost and performance: When to use each LLM?
Choose Claude 3.5 Sonnet when:
- You're working with complex code generation or data transformations
- You need the highest possible accuracy for customer-facing applications
- You're dealing with sophisticated analysis tasks
- Cost is less important than performance
Choose Claude 3.5 Haiku when:
- You need a balance of speed and accuracy
- You're processing high volumes of data
- You want strong coding capabilities at a lower price point
- Real-time processing is important
Choose GPT-4o when:
- You need well-rounded performance across various tasks
- You're focusing on content generation and analysis
- You want a balance of capability and cost
- Consistency across different types of tasks is important
Choose GPT-4o Mini when:
- You're processing very high volumes of data
- You're working with straightforward, repetitive tasks
- Cost efficiency is a primary concern
- Real-time processing at scale is needed
Real-World Applications: The best models for specific use cases
We're already seeing customers use these models in creative ways:
E-commerce Product Categorization
- GPT-4o Mini for initial bulk categorization: Perfect for processing thousands of products daily at minimal cost ($0.15/1M tokens). While it may not catch every nuance, it's extremely cost-effective for basic categorization tasks where 90%+ accuracy is acceptable.
- Claude Sonnet for complex edge cases and taxonomy development: With its superior performance on graduate-level reasoning (50.4% vs GPT-4o's 35.7%), Sonnet excels at handling ambiguous cases and developing sophisticated classification systems. Worth the higher cost ($3.00/1M tokens) for these critical decisions that shape your entire catalog structure.
Customer Support Automation
- Haiku for real-time ticket routing and initial response generation: Combines speed with strong performance (80% on reasoning tasks) at a moderate cost ($1.00/1M tokens). This makes it ideal for real-time operations where you need quick, accurate responses without breaking the bank.
- GPT-4o for detailed response drafting: Shows excellent performance on knowledge-based tasks (96.3% on ARC-Challenge) at a middle-tier price point ($2.50/1M tokens). Perfect for generating comprehensive, accurate responses that require deep context understanding.
Sales Operations
- Sonnet for complex lead scoring and account prioritization: Its superior performance on complex reasoning tasks makes it worth the premium for high-stakes decisions that directly impact revenue. The difference between 85% and 95% accuracy could mean millions in properly prioritized opportunities.
- GPT-4o Mini for high-volume data enrichment: At just $0.15/1M tokens, it's the most cost-effective option for basic data enrichment tasks like standardizing company names or extracting basic firmographic data. When you're processing millions of records monthly, this cost efficiency is crucial.
Getting started with LLMs
Ready to experiment with these models in your data workflows? Here's how to begin:
- Start with a lower-cost model (GPT-4o Mini or Haiku) for initial testing
- Identify your most critical use cases where accuracy matters most
- A/B test different models on your specific data
- Monitor both performance and costs to optimize your usage
Remember, you're not locked into a single model - many of our customers use different models for different tasks, optimizing for both performance and cost.