MATH / MATH-500
MATH is a competition-level mathematics benchmark with problems spanning algebra, geometry, number theory, and calculus. MATH-500 is a commonly used 500-problem subset. The large gap between reasoning and general models on this benchmark makes it valuable for evaluating chain-of-thought capability.
Key facts
How MATH / MATH-500 works
Problems are drawn from AMC (American Mathematics Competitions), AIME, and other mathematics competitions. Each requires multi-step reasoning and precise calculation. MATH-500 uses a curated 500-problem subset of the original 12,500-problem dataset. Models are evaluated on whether they produce the correct final answer.
What is a good MATH / MATH-500 score?
Reasoning models score 88-97%. General models score 50-90%. The wide spread between reasoning and general models is the benchmark's key feature. A score above 95% indicates exceptional mathematical reasoning - only a handful of models achieve this.
Why MATH / MATH-500 matters
MATH reveals the difference between models that can reason through multi-step problems and those that pattern-match. The gap between reasoning models (88-97%) and general models (50-90%) is larger on MATH than on most other benchmarks. This makes it uniquely useful for evaluating whether a model's thinking or chain-of-thought mode provides genuine benefit.
How does MATH / MATH-500 compare to other benchmarks?
MATH tests multi-step mathematical reasoning, while AIME tests harder competition-level problems requiring creative insight. MATH has a wider range of difficulty and is more widely reported. Compared to MMLU-Pro (which includes some math), MATH is more focused and harder. Compared to GPQA Diamond (science reasoning), MATH tests formal mathematical reasoning specifically.
Which AI model has the highest MATH / MATH-500 score?
Top 10 models by MATH / MATH-500
Frequently asked questions
See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.
Get notified when we update the tracker
New model releases, benchmark updates, and pricing changes. No spam.