GPQA Diamond
GPQA Diamond is a graduate-level science benchmark that tests AI models on physics, biology, and chemistry questions specifically designed so answers cannot be found through web search. It is currently the most trusted reasoning benchmark for comparing frontier AI models.
Key facts
How GPQA Diamond works
GPQA Diamond uses multiple-choice questions reviewed and validated by PhD-level domain experts. Each question is tested to ensure it cannot be answered by searching the internet, requiring the model to reason from genuine understanding. The Diamond subset contains the hardest questions where even human domain experts frequently disagree on the correct answer.
What is a good GPQA Diamond score?
Frontier models score between 80% and 94% on GPQA Diamond. Scores below 70% indicate significant reasoning gaps. A score above 90% places a model in the top tier of scientific reasoning capability. The benchmark has enough headroom that scores will remain meaningful as models continue to improve.
Why GPQA Diamond matters
GPQA Diamond is the best available measure of deep scientific reasoning in AI models. Unlike MMLU-Pro, which is near-saturated at the frontier, GPQA Diamond produces meaningful 10-15 point spreads between top models. It is hard to game through memorization, hard to inflate through prompt engineering, and provides the most honest signal of reasoning capability improvement. When a lab claims a reasoning breakthrough, GPQA Diamond is typically the benchmark they cite.
How does GPQA Diamond compare to other benchmarks?
GPQA Diamond is harder than MMLU-Pro and tests deeper reasoning rather than broad knowledge recall. While MMLU-Pro has frontier models clustered within 5 points (83-90%), GPQA Diamond spreads them across 15 points (80-94%), making it far more useful for distinguishing between top models. Compared to HLE (Humanity's Last Exam), GPQA Diamond is more focused on science specifically, while HLE spans all domains. GPQA Diamond is more widely reported than HLE, with most frontier model launches including a GPQA score.
Which AI model has the highest GPQA Diamond score?
Top 10 models by GPQA Diamond
Frequently asked questions
See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.
Get notified when we update the tracker
New model releases, benchmark updates, and pricing changes. No spam.