MMLU, GPQA, SWE-bench: Common AI Benchmarks Explained

If you have read an AI model announcement recently, you will have met a small alphabet of acronyms β€” MMLU, GPQA, SWE-bench, MATH, HumanEval β€” presented as evidence that one model outclasses another. To anyone outside the field they look like a secret code, and it is tempting to either ignore them or take them at face value. Neither response serves you well.

This guide walks through the benchmarks you are most likely to encounter, one at a time, in plain language. For each, we explain what it tests, how to interpret a score, and whether it is likely to matter for the kind of work a business actually does. By the end you should be able to glance at a comparison table and know which numbers deserve your attention.

A quick word on what these scores mean

Before the tour, a reminder of the basics. A benchmark is a fixed set of questions with known answers, given to every model so the results can be compared. The score is almost always the percentage answered correctly, so a model that scores 80 on a benchmark got four-fifths of that particular test right. A higher number is better, but only on the narrow thing that benchmark measures. If you want the full picture of how these tests are built and where they go wrong, our companion article on how AI benchmarks work covers it in detail.

One thing to hold onto throughout: these names sound technical, but each one is really just a question of the form β€œcan the model do this particular kind of thing?” Once you know what each benchmark is asking, the acronyms lose their mystery and become a useful shorthand for the skills you might care about.

Each test, one skill
No single benchmark measures overall intelligence β€” each probes one narrow ability, which is why models are run against many
Source: Stanford HAI AI Index

MMLU: broad general knowledge

MMLU stands for Massive Multitask Language Understanding. It is a sprawling test of factual knowledge and understanding across dozens of subjects β€” history, law, medicine, mathematics, and more β€” posed as multiple-choice questions. Its purpose is to measure how broadly a model has absorbed human knowledge rather than how deeply it can reason about any one thing.

For a business, MMLU is a reasonable proxy for general usefulness as an all-round assistant. A model that scores well tends to be a knowledgeable, capable generalist. The catch is that MMLU is one of the oldest and most widely published benchmarks, which makes it especially prone to two problems: contamination, where the answers have leaked into training data, and saturation, where the best models now cluster so close to the top that the test can no longer separate them. Read a high MMLU score as β€œcompetent generalist” rather than β€œclearly the best”.

GPQA: graduate-level reasoning

GPQA β€” the β€œG” stands for graduate-level, and the β€œQ”s for a quality-controlled question-answer set β€” was created partly in response to MMLU's saturation. Its questions are deliberately hard, written by experts in fields such as biology, physics, and chemistry, and designed so that even a knowledgeable non-specialist with internet access would struggle. The point is to test genuine reasoning, not recall that could be looked up.

Because the questions are so demanding, scores on GPQA are much lower than on MMLU, and the gaps between models are more meaningful. If your work involves complex analysis, technical problem-solving, or anything requiring the model to reason carefully through difficult material, GPQA is one of the more informative benchmarks to glance at. A model that holds up on GPQA is one you can lean on for harder thinking, not just quick lookups.

SWE-bench: fixing real software

SWE-bench is the most concrete benchmark on this list, and one of the most respected. It draws real problems from real software projects β€” actual bugs that real developers reported β€” and asks the model to produce a fix. Crucially, the fix is then run against the project's own automated tests. The model only scores if its solution genuinely makes the software work, not merely if it looks plausible.

This grounding in a verifiable, real-world outcome is what makes SWE-bench valuable. It is far harder to game than a multiple-choice test, because there is no partial credit for an answer that does not actually run. If you are evaluating AI coding assistants, SWE-bench is the benchmark to watch β€” though, as always, your own codebase is the real test. Our overview of AI coding assistants puts this in context.

Common benchmarks at a glance
Benchmark What it measures
MMLU Broad general knowledge across subjects
GPQA Graduate-level expert reasoning
SWE-bench Fixing real-world software bugs
MATH / AIME Multi-step mathematical problem solving
HumanEval Writing small, correct code functions

MATH and AIME: mathematical problem-solving

The MATH benchmark, and the related AIME problems drawn from a well-known mathematics competition, test whether a model can work through multi-step mathematical reasoning to reach a correct answer. These are not arithmetic drills; they require the model to plan a solution, carry it out, and arrive at a precise result that is easy to check.

Why should a non-mathematical business care? Because performance on hard mathematics is widely treated as a signal of careful, structured reasoning in general. A model that can reliably solve these problems tends to be better at any task requiring it to follow a chain of logic without losing the thread β€” planning, structured analysis, and the like. Read strong MATH or AIME scores as evidence of disciplined reasoning rather than as relevant only to mathematicians.

HumanEval: writing small pieces of code

HumanEval is an older coding benchmark that asks a model to write small, self-contained functions from a description, then checks them by running tests. It is simpler and narrower than SWE-bench β€” isolated puzzles rather than messy real-world projects β€” and like MMLU it has largely saturated, with leading models scoring very high. It remains a quick sanity check of basic coding ability, but a strong HumanEval score is no longer a meaningful differentiator the way a strong SWE-bench score is.

Why new benchmarks keep appearing

You may notice that the most informative benchmarks on this list β€” GPQA and SWE-bench β€” are also among the newest, while the older ones have lost their power to separate models. This is not a coincidence. As models improve, they exhaust the difficulty of existing tests, and researchers respond by building harder ones. Expect this cycle to continue: the benchmark names in headlines a year from now may differ from today's. The underlying lesson, though, stays the same β€” favour tests tied to hard, verifiable tasks, and treat saturated ones as background noise.

Real-world over recall
Benchmarks tied to verifiable outcomes, like SWE-bench, are harder to game than tests that reward memorised facts
Source: Artificial Analysis

Which benchmarks should you actually care about?

The honest answer is: the ones that match your work, and none too literally. If you want a capable all-round assistant for writing, summarising, and planning, broad tests like MMLU give a rough sense of general competence, but treat closely grouped top scores as a tie. If your work is technical or analytical, GPQA and the mathematics benchmarks are more telling. If you are choosing a coding tool, SWE-bench is the one to weigh most heavily.

Whatever your work, resist two temptations. The first is to fixate on a single headline number; a model strong across several benchmarks is a safer bet than one that tops a single famous test. The second is to mistake any benchmark for your own reality. Public leaderboards such as Artificial Analysis aggregate these tests helpfully, and crowd-voted comparisons like LMArena add a human-preference dimension, but the decisive test is always running your own tasks through the shortlisted tools.

A simple way to keep all of this in proportion is to remember what the scores are for. They exist to help researchers and buyers compare models at a glance β€” a starting point, not a substitute for judgement. The moment a benchmark becomes the goal rather than a guide, it starts to mislead. Use these numbers to narrow your options quickly, then trust your own hands-on trial to make the final call. For a structured way to do that comparison, see our guide to evaluating AI tools, and for the bigger picture our pillar on what artificial intelligence is.

How these benchmarks fit together

It can be tempting to treat these tests as rivals, but they are better understood as complementary lenses, each illuminating a different facet of what a model can do. Broad knowledge tests like MMLU tell you whether a model is a capable generalist. Hard reasoning tests like GPQA tell you whether it can think carefully through difficult material. Mathematics tests reveal its discipline in following a long chain of logic. Coding benchmarks, especially the verified ones, show whether it can produce something that genuinely works. No single test captures all of this, which is exactly why announcements present a whole table rather than one number.

For a business, the practical takeaway is to read the table selectively rather than trying to absorb every figure. Identify the one or two skills that matter for the work you have in mind, find the benchmarks that measure those skills, and let the rest serve as background. A model that leads on a skill irrelevant to you is no more useful than a car that is fastest on a track you will never drive. Matching the measurement to your need is the whole art of reading these comparisons well.

A word on how quickly this changes

One final caution: anything specific you read about benchmark scores ages quickly. New models arrive constantly, older ones are updated, and the tests themselves are revised or replaced as they saturate. Treat the particular numbers you see today as a snapshot rather than a settled ranking. What stays constant is the way of thinking laid out here β€” understanding what each test measures, favouring verified over recalled performance, and trusting your own hands-on trial above any published figure. Hold onto the method, and the shifting numbers will trouble you far less.

Translating scores into a confident decision

Suppose you have read a comparison table and one or two models stand out on the skills you care about. What next? The mistake is to stop there and simply adopt the top scorer. A wiser path is to treat the table as having narrowed a crowded field down to a short, sensible list. From that list, the deciding factors are usually practical rather than numerical: how the tool fits the way you already work, how quickly it responds, how clearly it explains itself, and how comfortable your team feels using it. These qualities never appear on a benchmark, yet they often matter more day to day than a few points of measured skill.

So give each shortlisted model the same small set of real tasks from your own work and compare the results with your own eyes. Because those tasks are specific to you and were never published, no model could have memorised them, which neatly sidesteps the contamination problem that quietly inflates so many public scores. Whichever tool produces the most useful results on your own material, with the least correction, is the right answer for you β€” regardless of where it happened to sit on the leaderboard. The benchmarks pointed you to the shortlist; your own judgement makes the final call.

Frequently asked questions

What does MMLU actually test?+
MMLU measures broad general knowledge across dozens of subjects using multiple-choice questions. It is a reasonable proxy for how capable a model is as an all-round assistant, though it has largely saturated at the top.
Why is SWE-bench considered more trustworthy?+
Because it uses real software bugs and checks each fix against the project's own automated tests. The model only scores if its solution genuinely works, which is far harder to fake than answering multiple-choice questions.
Do I need to understand all these benchmarks?+
No. Focus on the one or two that match your work: broad tests for a general assistant, SWE-bench for coding tools, GPQA and mathematics tests for technical reasoning. The rest is useful context, not essential.
Why do math benchmarks matter for non-math work?+
Strong performance on hard mathematics signals careful, multi-step reasoning, which tends to carry over to planning and structured analysis. It is read as a sign of disciplined thinking, not just mathematical skill.

References

  1. Stanford HAI, AI Index Report β€” hai.stanford.edu
  2. Artificial Analysis, independent AI benchmarking β€” artificialanalysis.ai

Not sure which tool fits your work behind all these numbers? Try our WhatsApp AI chatbot or get in touch for a straightforward recommendation.

Back to blog