Coding Emerging standard

LiveCodeBench

LiveCodeBench is a coding benchmark using competitive programming problems sourced from recent competitions, making it resistant to data contamination. It is harder and more reliable than HumanEval and is increasingly used as the primary coding benchmark by evaluation platforms.

Key facts

How LiveCodeBench works

LiveCodeBench presents competitive programming problems drawn from recent coding competitions that postdate model training cutoffs. This prevents models from having memorized solutions during training. Models must solve each problem and pass all test cases on the first attempt (pass@1). Problems range from straightforward algorithmic tasks to complex multi-step reasoning challenges.

What is a good LiveCodeBench score?

Scores range very widely from 17% to 91%. Frontier reasoning models score 80-91%, general frontier models score 45-80%, and smaller models score below 40%. The wide spread makes LiveCodeBench useful for comparing models at every tier, unlike HumanEval which only distinguishes at the lower end.

Why LiveCodeBench matters

LiveCodeBench addresses the main weakness of HumanEval: data contamination. Since problems come from recent competitions held after model training cutoffs, models cannot have memorized solutions. The very wide score range (17-91%) provides strong discrimination across all model tiers, from small open-weight models to frontier reasoning systems. It is increasingly replacing HumanEval as the primary coding benchmark.

How does LiveCodeBench compare to other benchmarks?

LiveCodeBench is harder and more contamination-resistant than HumanEval. While most frontier models score 90%+ on HumanEval (making it nearly useless for comparison), LiveCodeBench spreads them from 17% to 91%. Compared to SWE-bench, LiveCodeBench tests algorithmic problem-solving rather than real-world software engineering. Both are valuable but measure different skills.

Which AI model has the highest LiveCodeBench score?

Top 10 models by LiveCodeBench

Frequently asked questions

LiveCodeBench is a coding benchmark using competitive programming problems from recent competitions. Because problems postdate model training, scores cannot be inflated through memorization. It is increasingly replacing HumanEval as the primary coding benchmark.

Frontier reasoning models score 80-91%. General models score 45-80%. Below 40% indicates limited coding ability. The wide range makes it useful for comparing models at every level.

HumanEval uses 164 fixed Python problems that most frontier models have likely seen in training (scores 90%+). LiveCodeBench uses problems from recent competitions that postdate training, making it contamination-resistant with much wider score spreads (17-91%).

As of April 2026, Gemini 3 Flash leads at 90.8%, followed by o4-mini at 85.9% and Kimi K2.5 at 85.0%. Reasoning models with thinking modes score significantly higher.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS