Coding Near-saturated

HumanEval

HumanEval is a Python code generation benchmark with 164 programming problems. Originally the standard coding benchmark, it is now near-saturated with most frontier models scoring above 90%, making it less useful for comparing top models but still relevant as a minimum capability bar.

Key facts

How HumanEval works

HumanEval presents 164 Python programming problems, each with a function signature and docstring. The model must generate the complete function body. Success is measured by pass@1 - whether the generated code passes all test cases on the first attempt. Problems range from simple string manipulation to moderate algorithmic challenges.

What is a good HumanEval score?

Most frontier models score above 90% on HumanEval, making small score differences (1-2 points) largely meaningless. A score below 85% indicates a model with limited coding ability. The benchmark is useful primarily as a floor test rather than a ceiling discriminator.

Why HumanEval matters

HumanEval has historical significance as the original standardized coding benchmark. While near-saturated at the frontier (most top models score 90%+), it remains useful as a minimum capability bar. If a model scores below 85% on HumanEval, it likely has fundamental code generation limitations. For comparing frontier models, LiveCodeBench and SWE-bench are more informative.

How does HumanEval compare to other benchmarks?

HumanEval tests isolated function generation - the simplest form of code generation. LiveCodeBench tests harder competitive programming problems and is contamination-resistant. SWE-bench tests real-world software engineering with full codebases. HumanEval is the easiest of the three and provides the least discrimination at the frontier.

Which AI model has the highest HumanEval score?

Top 10 models by HumanEval

Frequently asked questions

HumanEval is a Python code generation benchmark with 164 programming problems. Models generate function bodies from docstrings and are evaluated on whether the code passes test cases on the first attempt (pass@1).

Most frontier models score above 90%. The benchmark is near-saturated, meaning scores above 90% do not meaningfully distinguish between models. Below 85% indicates limited coding capability.

HumanEval is near-saturated and less useful for comparing frontier models. LiveCodeBench and SWE-bench provide better discrimination. HumanEval remains useful as a minimum capability bar and for continuity with historical reporting.

As of April 2026, multiple models score 95%+, including o3 at 97.0%, Grok 4.1 Fast at 97.0%, and Claude Opus 4.6 at 96.0%. The narrow spread at the top makes HumanEval less informative than other coding benchmarks.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS