Coding Production-relevant

SWE-bench Verified

SWE-bench Verified measures AI models on their ability to resolve real GitHub issues from popular open-source Python repositories. It is the gold standard for evaluating coding agent capability and the most production-relevant benchmark for software engineering teams.

Key facts

How SWE-bench Verified works

Each SWE-bench task is a real GitHub issue with a known human-written fix. The model receives the issue description and the full repository code, then must produce a diff (code patch) that resolves the issue. Success is measured by whether the repository's existing test suite passes after the model's patch is applied. The Verified subset uses issues that have been manually validated by human reviewers to ensure the task is solvable and the test is fair.

What is a good SWE-bench Verified score?

Frontier models score between 54% and 81% on SWE-bench Verified. A score of 80%+ means the model can resolve 4 out of 5 real GitHub issues autonomously. Scores below 65% indicate the model handles simpler issues but struggles with complex multi-file changes. The benchmark is far from saturated, making it meaningful for years to come.

Why SWE-bench Verified matters

SWE-bench Verified is the closest benchmark to actual software engineering work. Unlike HumanEval (which tests isolated function generation from docstrings), SWE-bench requires understanding large real-world codebases, reading issue reports, debugging, and producing patches that pass existing tests. For enterprise teams evaluating whether an AI coding agent can handle real development tasks, SWE-bench is the most informative benchmark available.

How does SWE-bench Verified compare to other benchmarks?

SWE-bench Verified tests end-to-end software engineering, while HumanEval tests isolated code generation and LiveCodeBench tests competitive programming. SWE-bench is harder and more realistic than both - models score 54-81% on SWE-bench compared to 82-97% on HumanEval. The gap between these scores reveals whether a model can handle real-world complexity beyond toy problems.

Which AI model has the highest SWE-bench Verified score?

Top 10 models by SWE-bench Verified

Frequently asked questions

SWE-bench Verified is a coding benchmark that tests AI models on resolving real GitHub issues from popular open-source Python repositories. The model must read the issue, understand the codebase, and generate a working patch. It is the gold standard for evaluating AI coding agents.

Frontier models score between 54% and 81%. A score above 75% indicates strong autonomous coding ability. Claude Opus 4.6 and Gemini 3.1 Pro lead at approximately 80-81%.

HumanEval tests isolated function generation from docstrings (most models score 90%+). SWE-bench tests end-to-end software engineering with real codebases, real issues, and real test suites. SWE-bench is much harder, more realistic, and more relevant for evaluating coding agents.

As of April 2026, Claude Opus 4.6 leads at 80.8%, closely followed by Gemini 3.1 Pro at 80.6% and Claude Opus 4.5 at 80.9%. Scores come from provider disclosures and independent evaluations.

See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.

Get notified when we update the tracker

New model releases, benchmark updates, and pricing changes. No spam.

RSS