Reasoning Top discriminator

GPQA Diamond

GPQA Diamond is a graduate-level science benchmark that tests AI models on physics, biology, and chemistry questions specifically designed so answers cannot be found through web search. It is currently the most trusted reasoning benchmark for comparing frontier AI models.

Key facts

How GPQA Diamond works

GPQA Diamond uses multiple-choice questions reviewed and validated by PhD-level domain experts. Each question is tested to ensure it cannot be answered by searching the internet, requiring the model to reason from genuine understanding. The Diamond subset contains the hardest questions where even human domain experts frequently disagree on the correct answer.

What is a good GPQA Diamond score?

Frontier models score between 80% and 94% on GPQA Diamond. Scores below 70% indicate significant reasoning gaps. A score above 90% places a model in the top tier of scientific reasoning capability. The benchmark has enough headroom that scores will remain meaningful as models continue to improve.

Why GPQA Diamond matters

GPQA Diamond is the best available measure of deep scientific reasoning in AI models. Unlike MMLU-Pro, which is near-saturated at the frontier, GPQA Diamond produces meaningful 10-15 point spreads between top models. It is hard to game through memorization, hard to inflate through prompt engineering, and provides the most honest signal of reasoning capability improvement. When a lab claims a reasoning breakthrough, GPQA Diamond is typically the benchmark they cite.

How does GPQA Diamond compare to other benchmarks?

GPQA Diamond is harder than MMLU-Pro and tests deeper reasoning rather than broad knowledge recall. While MMLU-Pro has frontier models clustered within 5 points (83-90%), GPQA Diamond spreads them across 15 points (80-94%), making it far more useful for distinguishing between top models. Compared to HLE (Humanity's Last Exam), GPQA Diamond is more focused on science specifically, while HLE spans all domains. GPQA Diamond is more widely reported than HLE, with most frontier model launches including a GPQA score.

Which AI model has the highest GPQA Diamond score?

Top 10 models by GPQA Diamond

Frequently asked questions

GPQA Diamond is a graduate-level science benchmark with questions in physics, biology, and chemistry that are designed to be impossible to answer through web search. It tests genuine scientific reasoning and is the most trusted reasoning benchmark for frontier AI models in 2026.

Frontier models score between 80% and 94%. A score above 90% indicates top-tier scientific reasoning. Below 70% suggests significant reasoning limitations. The current leader is Gemini 3.1 Pro at 94.3%.

GPQA Diamond tests deeper scientific reasoning with questions that cannot be Googled, while MMLU-Pro tests broader knowledge across 57 subjects. GPQA Diamond has a wider score spread at the frontier (80-94%) compared to MMLU-Pro (83-90%), making it better for distinguishing top models.

As of April 2026, Gemini 3.1 Pro leads with 94.3%, followed by GPT-5.4 at 92.0% and Claude Opus 4.6 at 91.3%. Scores are from the DemandSphere AI Frontier Model Tracker.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS