It's official: Gemini Flash and Pro are now available within Census AI columns! But with more models comes more decisions. Here's our breakdown of how much Gemini costs, how it performs, and the use cases we like each model for.
Costs: How does Gemini pricing compare to Claude and GPT?
Before we get into performance, let's take a moment to understand costs. Here's how Gemini compares to the other large-scale LLMs:
Input Cost (per million tokens) |
Output Cost (per million tokens) |
|
Claude 3.5 Sonnet |
$3 |
$15 |
Claude 3.5 Haiku |
$1 |
$5 |
GPT-4o |
$2.5 |
$10 |
GPT-4o mini |
$.15 |
$6 |
Gemini 1.5 Flash |
Prompts up to 128k tokens- $0.075 |
Prompts up to 128k tokens- $0.30 |
Gemini 1.5 Pro |
Prompts up to 128k tokens- $1.25 Prompts longer than 128k- $2.50 |
Prompts up to 128k tokens- $2.50 Prompts longer than 128k- $10 |
Breaking down costs
Gemini is the sole model here that changes its pricing based on the length of the prompt. 100k tokens equates to roughly 80,000 words, according to Google's shared estimates (this isn't perfect - a token equates to roughly 4 characters including spaces and punctuation. )
Which means that the 128k threshold gets you just about 100k words. For context, that's about the length of a short novel
So whether this split pricing model applies to you depends on whether your prompts are likely to be very complex. For simple instructions, this is unlikely to be an issue. But for complex prompts that involve ingesting a lot of data, large json files, or reviewing long blocks of writing, you may incur extra costs.
Gemini Flash vs Pro
Like most of the other LLM providers, Google offers different models tuned for different needs.
Flash is Google's lightweight model optimized for speed. It has a context window of 1 million tokens.
Pro is the heavyweight model optimized for performance, with a context window of 2 million tokens.
Benchmark Performance for Gemini Models vs GPT and Claude
Time for some numbers. We compared Gemini to GPT and Claude across a series of leading LLM benchmarks to understand how it measures up:
Gemini 1.5 Flash | Gemini 1.5 Pro | Claude 3.5 Sonnet | Claude 3.5 Haiku | GPT-4o | GPT-4o Mini | ||
Undergraduate Level Knowledge (MMLU) | 78.9% (5-shot) | 89.5% (5-shot) | 86.8% (5-shot) | 85.0% (5-shot) | 86.4% (5-shot) | 84.0% (5-shot) | |
Graduate Level Reasoning (GQPA, Diamond) | 39.5% (0-shot) | 46.2% (0-shot) | 50.4% (0-shot CoT) | 48.0% (0-shot CoT) | 35.7% (0-shot CoT) | 33.0% (0-shot CoT) | |
Math Problem-Solving (MATH) | 54.9 (4-shot) | 67.7 (4-shot) | 60.1% (0-shot CoT) | 58.0% (0-shot CoT) | 52.9% (4-shot) | 50.0% (4-shot) | |
Code (HumanEval) | 74.3% (0-shot) | 84.1% (0-shot) | 84.9% (0-shot) | 80.0% (0-shot) | 67.0% (0-shot) | 65.0% (0-shot) | |
Reasoning Over Text (DROP, F1 Score) | 74.9% (variable-shot) | 78.4% (variable-shot) | 83.1% (3-shot) | 80.0% (3-shot) | 80.9% (3-shot) | 78.0% (3-shot) | |
Mixed Evaluations (BIG-Bench-Hard) | 85.5% (3-shot) | 89.2%. (3-shot) | 86.8% (3-shot CoT) | 84.0% (3-shot CoT) | 83.1% (3-shot CoT) | 80.0% (3-shot CoT) | |
Common Knowledge (HellaSwag) | 86.5% (10-shot) | 93.3% (10-shot) | 95.4% (10-shot) | 93.0% (10-shot) | 95.3% (10-shot) |
|
Breaking down performance
All of these models are highly performant. The Gemini models consistently perform close to par for their GPT counterparts for most applications across coding, general knowledge and math. However, Claude still out-performs the other models in most areas.
Gemini Flash really shines in cost-for-performance. Even cheaper than GPT 4o mini, this model is a workhorse that can really shine in small-scale applications.
So which model should I choose?
This is a complicated question. But based on our testing, we've found some key learnings:
- Lightweight models like Flash and mini are excellent for internal applications: We use mini for our fit score calculations, churn prevention and other prompts that involve processing a json and writing a summary output.
- More complex models are our pick for more complex or more "human" tasks: We love smarter models like Claude, Gemini Pro and GPT 4o for things like sentiment analysis and performance with more complex analysis tasks like PLG playbooks.
- We love the most performant models for externally-facing tasks: There's no doubt that Sonnet is an incredible model. But it comes at a hefty cost. We like to reserve this model for customer-facing tasks like writing personalized outbounds.