SWE-bench repository
Official benchmark code, data, and evaluation documentation.
Compute College
Learn what SWE-bench measures, why it matters for AI coding agents, and how software-engineering benchmarks connect to AI compute demand.
One concept connected to AI compute market decisions.
A practical introduction designed to be completed in one sitting.
Useful for developers, founders, procurement teams, and analysts tracking model-serving economics.
Plain-English definition
SWE-bench is a software-engineering benchmark that evaluates whether AI systems can resolve real GitHub issues by producing changes to real repositories that satisfy evaluation tests.
Why it matters
Repository-level repair is closer to deployed coding-agent work than short completions. If such workflows become reliable, developers can generate longer, repeated inference demand for debugging, patching, and validation.
Simple example
A task can give an agent a code repository plus an issue description, then evaluate whether the submitted patch resolves the problem under its tests. Tool access and agent scaffold affect both score and cost.
Example figures are illustrative calculations, not current quoted market prices.
Current example
The official SWE-bench repository describes a benchmark for resolving real-world GitHub issues and provides its benchmark variants, including SWE-bench Verified. Last checked: May 24, 2026.
Official benchmark code, data, and evaluation documentation.
No leaderboard performance claim is made here; consult the official benchmark configuration before comparing systems.
Market signal
Read SWE-bench gains as a possible coding-agent demand signal only when evaluation configuration is comparable and the capability is adopted for real engineering work.
Market read: this metric becomes an AI compute signal only when it changes serving volume, effective workload cost, or the capacity buyers require.
Common mistake
Do not compare SWE-bench values without checking the subset, scaffold, tools, test-time compute, and evaluation date.
Practical takeaway
Use SWE-bench as capability evidence, then measure your own repository tasks by cost per accepted patch and engineer review burden.
Decision check: identify the capability measured, the serving cost driver it affects, and the buyer behavior that would make capacity demand change.
Helpful memory trick
SWE-bench is closer to “fix this repo issue” than “write this function.”
Compute College
Follow model releases as AI compute market signals in the ComputeTape Morning Brief.
Compute College track
Continue this Compute College lesson path
Previous lesson
Continue the Model Costs track.
Next lesson
Continue the Model Costs track.