Product News

GPT vs Claude: What's The Best AI Model?

Ellen Perfect

27 January 2025

We're excited to announce that Census AI Columns now supports Claude, bringing Anthropic's leading AI models to our data activation platform. With this addition, we now offer the complete Claude 3.5 series (Sonnet and Haiku) alongside GPT-4o and GPT-4o Mini.

Anthropic has built incredibly performant models in both Haiku and Sonnet, and we're excited to add them to our toolkit. But these new options also present another layer of complexity to using AI columns: deciding which model to use.

It's a complicated question, so let's start by understanding the raw performance numbers.

Erich Hellstrom over at PromptLayer compiled an analysis of model performance across a series of categories. Here aer the findings:

Claude 3.5 Sonnet

Claude 3.5 Haiku

GPT-4o

GPT-4o Mini

Undergraduate Level Knowledge (MMLU)

86.8% (5-shot)

85.0% (5-shot)

86.4% (5-shot)

84.0% (5-shot)

Graduate Level Reasoning (GQPA, Diamond)

50.4% (0-shot CoT)

48.0% (0-shot CoT)

35.7% (0-shot CoT)

33.0% (0-shot CoT)

Grade School Math (GSM8K)

95.0% (0-shot CoT)

92.0% (0-shot CoT)

92.0% (5-shot CoT)

90.0% (5-shot CoT)

Math Problem-Solving (MATH)

60.1% (0-shot CoT)

58.0% (0-shot CoT)

52.9% (4-shot)

50.0% (4-shot)

Multilingual Math (MGSM)

90.7% (0-shot)

88.0% (0-shot)

74.5% (8-shot)

72.0% (8-shot)

Code (HumanEval)

84.9% (0-shot)

80.0% (0-shot)

67.0% (0-shot)

65.0% (0-shot)

Reasoning Over Text (DROP, F1 Score)

83.1% (3-shot)

80.0% (3-shot)

80.9% (3-shot)

78.0% (3-shot)

Mixed Evaluations (BIG-Bench-Hard)

86.8% (3-shot CoT)

84.0% (3-shot CoT)

83.1% (3-shot CoT)

80.0% (3-shot CoT)

Knowledge Q&A (ARC-Challenge)

96.4% (25-shot)

94.0% (25-shot)

96.3% (25-shot)

94.0% (25-shot)

Common Knowledge (HellaSwag)

95.4% (10-shot)

93.0% (10-shot)

95.3% (10-shot)

93.0% (10-shot)

An overview of performance numbers

The above table shows the model's performance across a series of benchmark tests. And while we won't get too into how these performance numbers are evaluated in this piece, I do want to call out a few of the big takeaways:

All of the models are smart: Seriously, they're all incredibly impressive. And for most lightweight tasks like content writing and basic reasoning, they can all definitely get a job done.
None of them are brilliant at complex reasoning: See the reasoning benchmarks highlighted in yellow. All of these models will need some guidance when it comes to solving complex math and reasoning problems. The good news is that you can do some of this legwork with prompt writing.
The biggest spread is in coding and multilingual math: See the rows highlighted in purple. These are the most important areas where you stand to lose on performance by picking a certain model.

But numbers can only tell us so much. So let's unpack this a little bit.

Unpacking AI Model Performance

The Standout: Claude 3.5 Sonnet

Claude 3.5 Sonnet emerges as the heavyweight champion in several key areas. Most notably, it achieved an impressive 84.9% score on the HumanEval coding benchmark - significantly outperforming GPT-4o's 67.0%. This isn't just a statistical difference; it translates to real-world advantages in tasks like:

Automating complex data transformations
Building sophisticated scoring algorithms
Generating production-ready code for data pipelines

Sonnet also shows remarkable strength in graduate-level reasoning tasks, scoring 50.4% on the GQPA Diamond benchmark compared to GPT-4o's 35.7%. This makes it particularly valuable for complex analysis tasks like:

Advanced customer segmentation
Sophisticated market analysis
Complex feature engineering

The Runner Up: Claude 3.5 Haiku

If Sonnet is the heavyweight, Haiku is the lightweight boxing champion - fast, efficient, and surprisingly powerful. While it trails Sonnet slightly in absolute performance (80.0% vs 84.9% on HumanEval), it maintains impressive capabilities while offering significantly faster processing times and lower costs.

The Versatile Performer: GPT-4o

GPT-4o shines in its consistency and well-rounded performance. It particularly excels in knowledge-based tasks, scoring neck-and-neck with Claude Sonnet on benchmarks like ARC-Challenge (96.3% vs 96.4%) and HellaSwag (95.3% vs 95.4%). This makes it an excellent choice for:

Content generation and analysis
Customer interaction automation
General-purpose data enrichment

The Budget Champion: GPT-4o Mini

Don't let the "Mini" fool you - GPT-4o Mini delivers impressive performance at a fraction of the cost. With pricing at just $0.15 per million input tokens (compared to Sonnet's $3.00), it's an excellent choice for high-volume, straightforward tasks.

The best LLM for you also depends on cost

Performance isn't everything, especially when you consider that performance costs. You need a right-sized tool for the task at hand if you want to scale up your model usage without going broke. So let's talk money. Here's how the pricing breaks down:

	Input Cost (per million tokens)	Output Cost (per million tokens)
Claude 3.5 Sonnet	$3	$15
Claude 3.5 Haiku	$1	$5
GPT-4o	$2.5	$10
GPT-4o mini	$.15	$6

As you can see, the most performant models are also pretty pricey. These costs can evolve, but these are estimates based on public information from 2024.

Balancing cost and performance: When to use each LLM?

Choose Claude 3.5 Sonnet when:

You're working with complex code generation or data transformations
You need the highest possible accuracy for customer-facing applications
You're dealing with sophisticated analysis tasks
Cost is less important than performance

Choose Claude 3.5 Haiku when:

You need a balance of speed and accuracy
You're processing high volumes of data
You want strong coding capabilities at a lower price point
Real-time processing is important

Choose GPT-4o when:

You need well-rounded performance across various tasks
You're focusing on content generation and analysis
You want a balance of capability and cost
Consistency across different types of tasks is important

Choose GPT-4o Mini when:

You're processing very high volumes of data
You're working with straightforward, repetitive tasks
Cost efficiency is a primary concern
Real-time processing at scale is needed

Real-World Applications: The best models for specific use cases

We're already seeing customers use these models in creative ways:

E-commerce Product Categorization

GPT-4o Mini for initial bulk categorization: Perfect for processing thousands of products daily at minimal cost ($0.15/1M tokens). While it may not catch every nuance, it's extremely cost-effective for basic categorization tasks where 90%+ accuracy is acceptable.
Claude Sonnet for complex edge cases and taxonomy development: With its superior performance on graduate-level reasoning (50.4% vs GPT-4o's 35.7%), Sonnet excels at handling ambiguous cases and developing sophisticated classification systems. Worth the higher cost ($3.00/1M tokens) for these critical decisions that shape your entire catalog structure.

Customer Support Automation

Haiku for real-time ticket routing and initial response generation: Combines speed with strong performance (80% on reasoning tasks) at a moderate cost ($1.00/1M tokens). This makes it ideal for real-time operations where you need quick, accurate responses without breaking the bank.
GPT-4o for detailed response drafting: Shows excellent performance on knowledge-based tasks (96.3% on ARC-Challenge) at a middle-tier price point ($2.50/1M tokens). Perfect for generating comprehensive, accurate responses that require deep context understanding.

Sales Operations

Sonnet for complex lead scoring and account prioritization: Its superior performance on complex reasoning tasks makes it worth the premium for high-stakes decisions that directly impact revenue. The difference between 85% and 95% accuracy could mean millions in properly prioritized opportunities.
GPT-4o Mini for high-volume data enrichment: At just $0.15/1M tokens, it's the most cost-effective option for basic data enrichment tasks like standardizing company names or extracting basic firmographic data. When you're processing millions of records monthly, this cost efficiency is crucial.

Getting started with LLMs

Ready to experiment with these models in your data workflows? Here's how to begin:

Start with a lower-cost model (GPT-4o Mini or Haiku) for initial testing
Identify your most critical use cases where accuracy matters most
A/B test different models on your specific data
Monitor both performance and costs to optimize your usage

Remember, you're not locked into a single model - many of our customers use different models for different tasks, optimizing for both performance and cost.