SWE-bench is a museum — a curated collection of historical bugs. needle-bench is a marketplace. Developers package their worst production failures as self-contained Docker images. An agent is dropped in with no context beyond "find the needle." Scoring is deterministic: same run, same scores. The benchmarks are real, the bugs are anonymized, and the leaderboard is public.
How it works
Each benchmark is a directory with four files:
benchmarks/off-by-one-pagination/
Dockerfile # builds the broken environment
Agentfile # agent constraints (tools, turn limit, token budget)
test.sh # exit 0 = fixed, exit 1 = still broken
.bench/
solution.patch # sealed ground truth — agent never sees this The Dockerfile contains a real bug from a real codebase. Not synthetic. Not contrived. The agent gets shell access and a failing test. That's it.
A real Agentfile looks like this:
FROM off-by-one-pagination
TOOL sh_run
TOOL ss
LIMIT turns 20
LIMIT tokens 100000
LIMIT wall_clock 300 Scoring: 11 metrics, no opinions
Every run produces the same 11 numbers. No subjective evaluation. No LLM-as-judge. The test passes or it doesn't. Everything else is measured, not guessed.
- resolved did test.sh pass? boolean. the only metric that matters for the headline.
- turns_to_discovery turns before the agent identifies the correct file and region.
- turns_to_fix turns before the agent produces a passing patch.
- signal_to_noise ratio of productive actions to total actions. 0.0–1.0.
- false_positives files edited that aren't in the solution. fewer is better.
- token_cost total tokens consumed (input + output). the bill.
- tokens_per_correct_line efficiency: how many tokens per line of correct fix.
- recovery_events times the agent went wrong, noticed, and self-corrected.
- recovery_rate of those recovery events, how many led back to the fix.
- wall_clock wall-clock seconds from start to score.
- blind_discovery did the agent find the bug with no hints beyond test output?
Leaderboard ranking: resolved (desc) → turns_to_fix (asc) → token_cost (asc) → wall_clock (asc).
Current benchmarks
| Name | Category | Difficulty |
|---|---|---|
off-by-one-pagination | logic error | easy |
race-condition-counter | concurrency | medium |
silent-data-corruption | data integrity | medium |
22 more benchmarks planned across easy, medium, and hard tiers. See roadmap.
Get started
Clone, verify, run. Docker required.
git clone https://github.com/os-tack/find-the-needle
cd needle-bench
make verify # confirms all benchmarks have valid solutions
make bench # runs all benchmarks (Docker required) Run a single benchmark:
make run BENCH=off-by-one-pagination Validate your own benchmark before submitting:
make validate BENCH=my-new-benchmark What the data looks like
Every run appends a record to scores.json. No interpretation layer. Raw numbers.
{
"benchmark": "off-by-one-pagination",
"agent": "claude-opus-4-6",
"resolved": true,
"turns_to_discovery": 1,
"turns_to_fix": 4,
"signal_to_noise": 0.85,
"false_positives": 0,
"token_cost": 32000,
"tokens_per_correct_line": 32000,
"recovery_events": 0,
"recovery_rate": 1.0,
"wall_clock": 45.2,
"blind_discovery": true
} Contribute a benchmark
Got a gnarly bug? Package it. The best benchmarks come from the worst debugging days.
- A real bug from a real codebase — not synthetic, not contrived
- A Dockerfile that builds in under 5 minutes
- A
test.shthat fails on the bug and passes on the fix - A sealed
.bench/solution.patchthe agent never sees