Knowledge Legacy baseline

MMLU-Pro

MMLU-Pro is a 10-choice graduate-level knowledge benchmark spanning STEM, law, medicine, history, and 57 academic subjects. It is the most widely recognized AI benchmark name, though frontier models now cluster between 83-90%, reducing its ability to distinguish top models.

Key facts

How MMLU-Pro works

MMLU-Pro presents multiple-choice questions at graduate level across 57 academic subjects. The Pro variant uses 10 answer choices instead of the original 4-5, reducing the benefit of random guessing and requiring stronger reasoning to eliminate incorrect options. Questions are sourced from professional exams, graduate coursework, and academic assessments.

What is a good MMLU-Pro score?

Frontier models score between 83% and 90% on MMLU-Pro. Scores above 85% are considered strong. Below 70% indicates a mid-tier or smaller model. The narrow 7-point spread at the top means small score differences (1-2 points) are often not meaningful and may reflect evaluation methodology rather than genuine capability differences.

Why MMLU-Pro matters

MMLU-Pro remains the most widely reported benchmark in AI model announcements. Despite near-saturation at the frontier, it is useful as a baseline measure of general knowledge and for comparing across model tiers (small vs large, general vs reasoning). Every major model launch includes an MMLU-Pro score, making it the common denominator for cross-provider comparison.

How does MMLU-Pro compare to other benchmarks?

MMLU-Pro tests broad knowledge recall across 57 subjects, while GPQA Diamond tests deep reasoning in science specifically. MMLU-Pro is easier and more saturated - top models cluster within 7 points (83-90%) - while GPQA Diamond spreads them across 15 points. For frontier model comparison, GPQA Diamond is more informative. MMLU-Pro remains useful as a baseline and for comparing smaller models.

Which AI model has the highest MMLU-Pro score?

Top 10 models by MMLU-Pro

Frequently asked questions

MMLU-Pro is a 10-choice graduate-level knowledge benchmark covering 57 academic subjects including STEM, law, medicine, and history. It is the most widely referenced AI benchmark, used in virtually every frontier model announcement since 2023.

Frontier models score 83-90%. Above 85% is considered strong. The benchmark is near-saturated at the frontier, meaning top models cluster closely and small differences may not be meaningful.

MMLU-Pro remains widely reported but is near-saturated at the frontier. GPQA Diamond and HLE are better for distinguishing between top models. MMLU-Pro is still useful as a baseline and for comparing models across different size tiers.

As of April 2026, Claude Opus 4.5 leads at 89.5%, followed by Claude Opus 4.6 at 89.0% and GPT-5 at 88.0%. Scores from the DemandSphere AI Frontier Model Tracker.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS