FOUND Meetup NYC - April 28th, 7pm, during SEO Week. Lightning talks, open bar, and 80 seats. Read more → Register free
Pricing
Get a demo

AI Benchmark Guide

What each benchmark measures, how it works, score ranges, and which models lead. Used in the AI Frontier Model Tracker.

Reasoning gpqa

GPQA Diamond

PhD-level science questions designed to be un-Googleable. Best reasoning discriminator at the frontier.

Top discriminator
Coding swe

SWE-bench Verified

Real GitHub issue resolution. Gold standard for coding agent capability.

Production-relevant
Knowledge mmlu

MMLU-Pro

10-choice graduate-level knowledge across STEM/law/medicine/history. Legacy but still widely referenced.

Legacy baseline
Reasoning hle

Humanity's Last Exam

~2500 extremely hard questions. Frontier ceiling test with very low saturation.

Ceiling test
Coding lcb

LiveCodeBench

Newer interactive coding benchmark. Less saturated than HumanEval.

Emerging standard
Coding he

HumanEval

Python code generation from docstrings. Near-saturated - most frontier models 90%+.

Near-saturated
Math math

MATH / MATH-500

Competition-level mathematics. Reasoning models score significantly higher.

Core reasoning
Math aime

AIME 2025

American Invitational Mathematics Exam. Hard competition math.

Core reasoning