Math Core reasoning

MATH / MATH-500

MATH is a competition-level mathematics benchmark with problems spanning algebra, geometry, number theory, and calculus. MATH-500 is a commonly used 500-problem subset. The large gap between reasoning and general models on this benchmark makes it valuable for evaluating chain-of-thought capability.

Key facts

How MATH / MATH-500 works

Problems are drawn from AMC (American Mathematics Competitions), AIME, and other mathematics competitions. Each requires multi-step reasoning and precise calculation. MATH-500 uses a curated 500-problem subset of the original 12,500-problem dataset. Models are evaluated on whether they produce the correct final answer.

What is a good MATH / MATH-500 score?

Reasoning models score 88-97%. General models score 50-90%. The wide spread between reasoning and general models is the benchmark's key feature. A score above 95% indicates exceptional mathematical reasoning - only a handful of models achieve this.

Why MATH / MATH-500 matters

MATH reveals the difference between models that can reason through multi-step problems and those that pattern-match. The gap between reasoning models (88-97%) and general models (50-90%) is larger on MATH than on most other benchmarks. This makes it uniquely useful for evaluating whether a model's thinking or chain-of-thought mode provides genuine benefit.

How does MATH / MATH-500 compare to other benchmarks?

MATH tests multi-step mathematical reasoning, while AIME tests harder competition-level problems requiring creative insight. MATH has a wider range of difficulty and is more widely reported. Compared to MMLU-Pro (which includes some math), MATH is more focused and harder. Compared to GPQA Diamond (science reasoning), MATH tests formal mathematical reasoning specifically.

Which AI model has the highest MATH / MATH-500 score?

Top 10 models by MATH / MATH-500

Frequently asked questions

MATH is a competition-level mathematics benchmark with problems from AMC, AIME, and other competitions. MATH-500 is a commonly used 500-problem subset. It tests multi-step reasoning across algebra, geometry, number theory, and calculus.

Reasoning models score 88-97%. General models score 50-90%. The large gap between these categories makes MATH useful for evaluating chain-of-thought and thinking mode effectiveness.

MATH draws from multiple competition sources and spans a wide difficulty range. AIME uses only American Invitational Mathematics Exam problems, which are harder and require more creative problem-solving. Both test mathematical reasoning but AIME is more selective.

As of April 2026, DeepSeek R1 and Kimi K2 lead at 97.3-97.4%, followed by o3 at 97.3% and MiniMax M1 at 96.8%. Reasoning models dominate this benchmark.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS