Best Practices

Understanding LLM Benchmarks

Ellen Perfect

5 February 2025

The great LLM performance race is on, and everyone wants to know which models are smarter, faster, and more performant. But breaking down benchmark leaderboards is a heavy task. For example, this open source leaderboard:

Leaderboard

The information is valuable, but dense - the benchmarks, the scoring methodologies all differ and all indicate slightly different ways for a model to perform.

So let's break down how to parse these benchmarks, starting with how the tests are conducted.

Prompting Techniques

Many LLM benchmark comparisons will indicate the style in which the test was conducted. This helps make fair comparisons between models that were tested under the same circumstances, but also tells us a bit about what kinds of interactions the model is most optimal for.

Few-shot: In this style of testing, the model is passed a number of examples along with the prompt to show how similar problems could be solved. This doesn't actually train the model--instead, this acts as context to help the model successfully shape its response. 1-shot prompting implies that one example was given, while 2-shot implies that 2 shots were given, and so on.

Chain-of-thought: Also known as zero-shot, this style of prompting involves giving the model no prior training or context beyond what it was given in its development. The model will have to intuit the best response and format.

Understanding these will help you to:

Ensure that you're making apples to apples comparisons across test results: A 0-shot result can be very different than a 5-shot result.
Optimize your model selection for your prompting style: If you plan to give a model examples or enforce a structured response format, a model that is most performant at 3- or 5-shot tests might work. But for situations where the prompt will be given with no other context, ensuring that you've selected a model that is performant at 0-shot tests may be important.

General Benchmarks

MMLU

Models are subjected to questions across 57 subjects spanning humanities, STEM, philosphy, law, and medicine. It includes over 16,000 questions and is one of the most commonly used benchmark tests.

The developers of the MMLU estimate that human domain experts achieve scores of around 89.8% accuracy, which the most advanced models are now beating.

In independent review of experts in these fields estimated that about 9% of the questions in the MMLU are either incorrectly worded or unclear, leading to the assumption that 90% is effectively the maximum attainable score.

Leading LLMs According to the MMLU

Claude 3.5 Sonnet (88.7)
GPT 4o (88.7)

Reasoning Benchmarks

BIG-Bench Hard

The BIG in BIG-Bench stands for Beyond the Imitation Game. This is an open-source standard developed by researchers at over 100 institutions. The tasks span the range from chess-based problems to identifying emojis.

The Hard iteration consists of 23 challenging tasks from the original BIG-bench where previous models failed to out-perform a human.

It's important to note that this test was designed with predicting the future capabilities of LLMs in mind, not necessarily ranking the performance of existing models.

DROP

DROP stands for discrete reasoning over paragraphs. It involves over 9500 challenges involving numerical manipulations, multi-step reasoning, and the interpretation of text-based data.

The DROP test has been known to output some inconsistencies based on the format a model defaults to answering in. For this reason, as of 2024 it is being reworked to better assess capabilities across a broad range of models.

HellaSwag

HellaSwag is designed to test common sense reasoning by providing 10,000 sentence completion tasks. It shines at predicting where a model might struggle with understanding complex context cues, and for that reason is a good indicator of conversational performance.

Leading LLMs According to HellaSwag

Compass MTL: 96.1
DeBERTA Large: 95.6
GPT-4: 95.3

Math Benchmarks

GSM 8k

Word problems involving grade school math capabilities. The problems typically involve between 2 and 8 steps and are designed to be solved by a grade school level student. Because of this, the test is a good indicator of reading comprehension and logical problem structuring.

Leading LLMs According to GSM 8k

Mistral -7B: 96.4
Claude 3 Opus: 95
GPT-4: 94.8

MATH

No acronyms this time. MATH stands straightforwardly for math. The test involves 12,500 problems covering geometry, algebra, probability, and calculus. Highly educated math students can score as low as 40%, with up to 90% being an outlier.

Leading LLMs According to MATH

Gemini 2.0: 89.7%
GPT 4o: 76.6%
Llama 3.1: 73.8%

Code Benchmarks

HumanEval

Developed by OpenAI, HumanEval is a dataset used to evaluate the coding capabilities of an LLM. It consists of 164 programming challenges including a standard set of coding tests. They assess capabilities related to language, algorithms, and simple mathematics. They are designed to mimic a junior software engineering interview.

Leading LLMs According to MATH

Claude 3.5 Sonnet: 92.0
GPT-4o: 90.2

Putting Theory to Practice

Ready to start thinking about use cases? Check out our guide to comparing Claude and GPT across common AI columns use cases.

Want to see some of these models in action? Check out this discussion of using Claude to automate PLG playbooks in Salesforce.