How to read LLM benchmarks

And why you shouldn’t trust them blindly

Gautham Srinivas

Published in

UX Collective

5 min readNov 24, 2024

A table showing benchmark results for different LLMs — Source: Anthropic’s Claude 3.5 Sonnet blog post

Disclaimer: The opinions stated here are my own, not necessarily those of my employer.

Every once in a while, there’s an announcement about a new model. It’s always better than its predecessor and it’s also better than all of the other frontier models in the market. The announcement comes with a table that looks like this:

The high-level message delivered here is “we are better than everyone else at almost everything”.

But how exactly is this claim made? What do these numbers mean? Can you take them at face value? Let’s break it down.

Why Benchmarks

Imagine you’re selling a car and you want to claim that it’s the best car on the market. However, potential buyers look for different features — some want the safest car, while others want the fastest. To convince the broadest audience to choose your car, you compare it against competitors using universally understood data: Safety rating, Fuel efficiency, 0-to-60 time, etc.

LLM Benchmarks serve a similar purpose. They are standardized tests and datasets designed to evaluate the performance of models across various tasks. They provide metrics and criteria to compare different models, ensuring consistency and objectivity in assessments.

How Benchmarks Work

Each Benchmark evaluates a capability that the LLM might be used for. HumanEval, for example, tests the model’s ability to write code. It consists of a set of 164 programming challenges (ex: finding a substring within a large string) and uses unit tests to check the functional correctness of generated code.

Another example is “Reasoning”, which can be defined in different ways. For the purposes of benchmarking, it’s defined as the ability to answer hard, complex questions that require step-by-step deduction and analyzing data. Here’s an example: “Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they be clearly resolved?” This question will sound hard unless you’re a physicist (or if you enjoy devouring books about quantum mechanics for some reason). It was from the GPQA benchmark, which has 448 such questions across different fields. Models receive a score based on how many questions they answer correctly.

Other tests include Language understanding (MMLU) and Math problem solving (MATH). They are similar tests with other types of questions. But within each test, it’s the same set of questions that every model is evaluated against. This is how consistency is maintained (not unlike the idea of humans taking standardized tests).

CoT & Few-shot

Zoomed in version of previous table, showing COT and few shot — Zooming into the same Claude table

Few-shot (like “3-shot”) refers to the amount of examples that were given to the model to better understand the task. 0-shot means no examples were given. CoT refers to Chain-of-Thought, where the model is asked to explain its reasoning process. CoT and examples can help improve response quality for certain tasks, which is why they are separately highlighted in benchmark results. Here’s an example I got from ChatGPT to explain CoT:

ChatGPT showing the difference between with and without COT

The problem with these Benchmarks

Lack of transparency

We don’t know how a model was trained. We don’t know how the benchmark tests were run. Then how can we say for sure that the model was not trained on the testing data? This issue is called contamination, which is a common problem in Machine Learning.

The sheer amount of data that LLMs are trained on has made it impossibly hard to detect or avoid this problem. It’s the human equivalent of finding out all the questions before the day of the exam.

Do these exams really measure ability or intelligence?

It’s likely that a significant portion of benchmark results can be explained by the ability of LLMs to memorize vast amounts of data. Then, are they really more capable than humans just because they got a better score? Here’s another example — There was news last year of ChatGPT acing the LSAT. It’s a pretty impressive feat, except when you think about (a) The fact that the LSAT usually contains questions from previous years and (b) How LSAT questions from previous years are freely available all over the internet. If ChatGPT aced the LSAT after seeing the questions before the test, would you still replace your lawyer with it?

How to choose for yourself

If you’re a developer or a team trying to use AI in your product, you need to construct your own evaluation. This evaluation needs to be focused on the use cases that matter to you. The dataset needs to be customized based on your requirements.

An example — If you want to automate customer service for your business, build a dataset of questions that your customers might ask. Then, build a system to prompt any LLM with these questions and score the answers they give you. Run this activity between the models you are considering and then make your choice.

If you’re an individual user looking to decide if you need to switch between Claude and ChatGPT, you can build a set of your most commonly used prompts (“write a cover letter”, “generate an image of…”, etc) and compare the different responses before making your decision. Even if the results aren’t statistically significant (unless you ask a lot of questions and repeat this process multiple times), it’s a controllable system that can explain your decision making. IMO, it’s much better than opaque benchmarking processes run by companies that are selling models to you.

Further reading on running evaluations can be found here. Better writing on the problems with benchmarks can be found here and here.

Disclaimer: The opinions stated here are my own, not necessarily those of my employer.

Please consider subscribing to my substack if you liked this article. Thank you!