Reasoning Ceiling test

Humanity's Last Exam

Humanity's Last Exam (HLE) is the hardest standardized AI evaluation available, with approximately 2,500 extremely difficult questions across academic domains. Created because older benchmarks became too easy for frontier models, HLE scores range from 20-41% - intentionally low to provide headroom for years of improvement.

Key facts

How Humanity's Last Exam works

HLE consists of expert-contributed questions that require deep specialist knowledge and multi-step reasoning across academic domains. Questions are sourced from domain experts in physics, mathematics, biology, philosophy, and more, then validated to ensure they are beyond the capability of current search engines and require genuine expertise to answer correctly.

What is a good Humanity's Last Exam score?

Frontier models score between 20% and 41% on HLE. The current best score is 41% (Gemini 3.1 Pro), and most frontier models score between 24-34%. These low scores are by design - the benchmark is created to be extremely difficult. A score above 35% currently places a model in the top tier.

Why Humanity's Last Exam matters

HLE is the frontier ceiling test. When a model improves on HLE, it represents a genuine capability advance, not benchmark gaming or memorization. The intentionally low score range (20-41%) means the benchmark will remain informative as models improve over the next several years. It is the best answer to the question: how close are we to AI systems that can match human experts across all domains?

How does Humanity's Last Exam compare to other benchmarks?

HLE is much harder than GPQA Diamond (which focuses on science) and spans all academic domains. While frontier models score 80-94% on GPQA Diamond, they score only 20-41% on HLE. HLE has the most headroom of any current benchmark - it will remain unsaturated and meaningful far longer than MMLU-Pro, HumanEval, or MATH.

Which AI model has the highest Humanity's Last Exam score?

Top 10 models by Humanity's Last Exam

Frequently asked questions

Humanity's Last Exam (HLE) is an extremely difficult AI benchmark with ~2,500 questions across academic domains. It was created because older benchmarks became too easy for frontier models. Scores range from 20-41%, providing headroom for years of model improvement.

Frontier models score 20-41%. The current leader is Gemini 3.1 Pro at 41%. Scores above 35% are top-tier. The low range is intentional - HLE is designed to remain challenging as models improve.

HLE questions are intentionally extremely difficult, requiring deep specialist knowledge that even human domain experts find challenging. The low scores (20-41%) ensure the benchmark remains informative and unsaturated for years, unlike MMLU-Pro which is already near-saturated at 83-90%.

As of April 2026, Gemini 3.1 Pro leads at 41.0%, followed by Gemini 3 Flash at 33.7% and Grok 4 at 24.0%. Very few models have published HLE scores.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS