Your worst debugging day, everyone's benchmark.

Developers submit real bugs as frozen Docker snapshots. AI agents get one prompt: "find the needle." 11 metrics. Deterministic scoring. No hiding.

Get started View leaderboard GitHub

SWE-bench is a museum — a curated collection of historical bugs. needle-bench is a marketplace. Developers package their worst production failures as self-contained Docker images. An agent is dropped in with no context beyond "find the needle." Scoring is deterministic: same run, same scores. The benchmarks are real, the bugs are anonymized, and the leaderboard is public.

3 benchmarks
11 metrics
1 models tested
<5m build time

How it works

Each benchmark is a directory with four files:

benchmarks/off-by-one-pagination/
  Dockerfile        # builds the broken environment
  Agentfile         # agent constraints (tools, turn limit, token budget)
  test.sh           # exit 0 = fixed, exit 1 = still broken
  .bench/
    solution.patch  # sealed ground truth — agent never sees this

The Dockerfile contains a real bug from a real codebase. Not synthetic. Not contrived. The agent gets shell access and a failing test. That's it.

A real Agentfile looks like this:

FROM off-by-one-pagination
TOOL sh_run
TOOL ss
LIMIT turns 20
LIMIT tokens 100000
LIMIT wall_clock 300

Scoring: 11 metrics, no opinions

Every run produces the same 11 numbers. No subjective evaluation. No LLM-as-judge. The test passes or it doesn't. Everything else is measured, not guessed.

Leaderboard ranking: resolved (desc) → turns_to_fix (asc) → token_cost (asc) → wall_clock (asc).

Current benchmarks

Name Category Difficulty
off-by-one-pagination logic error easy
race-condition-counter concurrency medium
silent-data-corruption data integrity medium

22 more benchmarks planned across easy, medium, and hard tiers. See roadmap.

Get started

Clone, verify, run. Docker required.

git clone https://github.com/os-tack/find-the-needle
cd needle-bench
make verify   # confirms all benchmarks have valid solutions
make bench    # runs all benchmarks (Docker required)

Run a single benchmark:

make run BENCH=off-by-one-pagination

Validate your own benchmark before submitting:

make validate BENCH=my-new-benchmark

What the data looks like

Every run appends a record to scores.json. No interpretation layer. Raw numbers.

{
  "benchmark": "off-by-one-pagination",
  "agent": "claude-opus-4-6",
  "resolved": true,
  "turns_to_discovery": 1,
  "turns_to_fix": 4,
  "signal_to_noise": 0.85,
  "false_positives": 0,
  "token_cost": 32000,
  "tokens_per_correct_line": 32000,
  "recovery_events": 0,
  "recovery_rate": 1.0,
  "wall_clock": 45.2,
  "blind_discovery": true
}

Contribute a benchmark

Got a gnarly bug? Package it. The best benchmarks come from the worst debugging days.

Contribution guide