AI Benchmark Guide
What each benchmark measures, how it works, score ranges, and which models lead. Used in the AI Frontier Model Tracker.
GPQA Diamond
PhD-level science questions designed to be un-Googleable. Best reasoning discriminator at the frontier.
Top discriminatorSWE-bench Verified
Real GitHub issue resolution. Gold standard for coding agent capability.
Production-relevantMMLU-Pro
10-choice graduate-level knowledge across STEM/law/medicine/history. Legacy but still widely referenced.
Legacy baselineHumanity's Last Exam
~2500 extremely hard questions. Frontier ceiling test with very low saturation.
Ceiling testLiveCodeBench
Newer interactive coding benchmark. Less saturated than HumanEval.
Emerging standardHumanEval
Python code generation from docstrings. Near-saturated - most frontier models 90%+.
Near-saturatedMATH / MATH-500
Competition-level mathematics. Reasoning models score significantly higher.
Core reasoningAIME 2025
American Invitational Mathematics Exam. Hard competition math.
Core reasoning