HumanEval
HumanEval is a Python code generation benchmark with 164 programming problems. Originally the standard coding benchmark, it is now near-saturated with most frontier models scoring above 90%, making it less useful for comparing top models but still relevant as a minimum capability bar.
Key facts
How HumanEval works
HumanEval presents 164 Python programming problems, each with a function signature and docstring. The model must generate the complete function body. Success is measured by pass@1 - whether the generated code passes all test cases on the first attempt. Problems range from simple string manipulation to moderate algorithmic challenges.
What is a good HumanEval score?
Most frontier models score above 90% on HumanEval, making small score differences (1-2 points) largely meaningless. A score below 85% indicates a model with limited coding ability. The benchmark is useful primarily as a floor test rather than a ceiling discriminator.
Why HumanEval matters
HumanEval has historical significance as the original standardized coding benchmark. While near-saturated at the frontier (most top models score 90%+), it remains useful as a minimum capability bar. If a model scores below 85% on HumanEval, it likely has fundamental code generation limitations. For comparing frontier models, LiveCodeBench and SWE-bench are more informative.
How does HumanEval compare to other benchmarks?
HumanEval tests isolated function generation - the simplest form of code generation. LiveCodeBench tests harder competitive programming problems and is contamination-resistant. SWE-bench tests real-world software engineering with full codebases. HumanEval is the easiest of the three and provides the least discrimination at the frontier.
Which AI model has the highest HumanEval score?
Top 10 models by HumanEval
Frequently asked questions
See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.
Get notified when we update the tracker
New model releases, benchmark updates, and pricing changes. No spam.