Humanity's Last Exam
Humanity's Last Exam (HLE) is the hardest standardized AI evaluation available, with approximately 2,500 extremely difficult questions across academic domains. Created because older benchmarks became too easy for frontier models, HLE scores range from 20-41% - intentionally low to provide headroom for years of improvement.
Key facts
How Humanity's Last Exam works
HLE consists of expert-contributed questions that require deep specialist knowledge and multi-step reasoning across academic domains. Questions are sourced from domain experts in physics, mathematics, biology, philosophy, and more, then validated to ensure they are beyond the capability of current search engines and require genuine expertise to answer correctly.
What is a good Humanity's Last Exam score?
Frontier models score between 20% and 41% on HLE. The current best score is 41% (Gemini 3.1 Pro), and most frontier models score between 24-34%. These low scores are by design - the benchmark is created to be extremely difficult. A score above 35% currently places a model in the top tier.
Why Humanity's Last Exam matters
HLE is the frontier ceiling test. When a model improves on HLE, it represents a genuine capability advance, not benchmark gaming or memorization. The intentionally low score range (20-41%) means the benchmark will remain informative as models improve over the next several years. It is the best answer to the question: how close are we to AI systems that can match human experts across all domains?
How does Humanity's Last Exam compare to other benchmarks?
HLE is much harder than GPQA Diamond (which focuses on science) and spans all academic domains. While frontier models score 80-94% on GPQA Diamond, they score only 20-41% on HLE. HLE has the most headroom of any current benchmark - it will remain unsaturated and meaningful far longer than MMLU-Pro, HumanEval, or MATH.
Which AI model has the highest Humanity's Last Exam score?
Top 10 models by Humanity's Last Exam
Frequently asked questions
See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.
Get notified when we update the tracker
New model releases, benchmark updates, and pricing changes. No spam.