SWE-bench Verified
SWE-bench Verified measures AI models on their ability to resolve real GitHub issues from popular open-source Python repositories. It is the gold standard for evaluating coding agent capability and the most production-relevant benchmark for software engineering teams.
Key facts
How SWE-bench Verified works
Each SWE-bench task is a real GitHub issue with a known human-written fix. The model receives the issue description and the full repository code, then must produce a diff (code patch) that resolves the issue. Success is measured by whether the repository's existing test suite passes after the model's patch is applied. The Verified subset uses issues that have been manually validated by human reviewers to ensure the task is solvable and the test is fair.
What is a good SWE-bench Verified score?
Frontier models score between 54% and 81% on SWE-bench Verified. A score of 80%+ means the model can resolve 4 out of 5 real GitHub issues autonomously. Scores below 65% indicate the model handles simpler issues but struggles with complex multi-file changes. The benchmark is far from saturated, making it meaningful for years to come.
Why SWE-bench Verified matters
SWE-bench Verified is the closest benchmark to actual software engineering work. Unlike HumanEval (which tests isolated function generation from docstrings), SWE-bench requires understanding large real-world codebases, reading issue reports, debugging, and producing patches that pass existing tests. For enterprise teams evaluating whether an AI coding agent can handle real development tasks, SWE-bench is the most informative benchmark available.
How does SWE-bench Verified compare to other benchmarks?
SWE-bench Verified tests end-to-end software engineering, while HumanEval tests isolated code generation and LiveCodeBench tests competitive programming. SWE-bench is harder and more realistic than both - models score 54-81% on SWE-bench compared to 82-97% on HumanEval. The gap between these scores reveals whether a model can handle real-world complexity beyond toy problems.
Which AI model has the highest SWE-bench Verified score?
Top 10 models by SWE-bench Verified
Frequently asked questions
See all benchmark scores in the AI Frontier Model Tracker. Compare across all 8 benchmarks.
Get notified when we update the tracker
New model releases, benchmark updates, and pricing changes. No spam.