v7.6.0 — native vendor CLI vs ostk kernel. 8 frontier models × 38 benchmarks, single-shot (samples=1), captured 2026-06-15. Cost/token Δ are resolved-gated (cells both arms solved) and per-bucket priced. Anthropic carries a full cache split; other harnesses report a usable total (basis noted per model) — see the per-cell matrix below.

Cost Δ / Token Δ = kernel vs native (negative = kernel cheaper/lighter). Click headers to sort, or jump to the per-cell matrix ↓

		Native	Kernel	Native (vendor CLI)		Kernel (ostk)
Model	Type	Solve	Solve	Cost	Tokens	Cost	Tokens	Cost Δ	Tok Δ
Loading...

^* B*: kernel solve sourced from the generic OpenRouter kernel (no hand-written native driver for this provider) — distinct from true B (kernel-cpu, native driver).

Per-cell results

Every model × benchmark. Each cell splits native │ kernel, colored by outcome. Benchmarks run hardest → easiest left to right. Tap any cell for the full token & cost breakdown.

solved failed timeout harness n/a

Loading matrix…

Results

Key findings (v7.6.0)

The kernel's cache win is Anthropic-specific. With the cache-sliver active, opus ties native on solve at −2.5% cost / −31% tokens; sonnet is −42% cost / −66% tokens. Explicit cache_control is Anthropic-only — that's the lever.
Outside Anthropic, the kernel costs more — it routes through clients that forfeit the provider's automatic prompt caching the native CLI gets for free. Per both-solved cell: gemini +116%, gpt-5.5 +276%. Honest inverse of the opus story.
Harness ceilings, not just models. devstral's native (Mistral Vibe) caps at 20 turns and solved 6/38; through the kernel it reached 32/38. Real, but it's the native harness's limit being removed, not raw capability.
Solve parity holds for the strong models (opus, gpt, gemini, deepseek within 1–3 cells across arms); the kernel's value is provider- and harness-specific, not universal.
A genuine difficulty ceiling. Only 1 of 37 benches (postgres-migration-schema-drift, 12%) is a true wall and 9 discriminate; 27 are floor. The matrix above shows it — benches ordered hardest→easiest.

Methodology

38 Docker benchmarks with real bugs (test.sh exits 0 = fixed). 2 legacy pre-kernel tests retired.
Native: vendor CLI with its own key. Kernel (B): ostk with native CpuDriver. B*: ostk via generic OpenRouter (no native driver for that provider).
Cost summed per token bucket (fresh 1×, cache-read 0.1×, cache-create 1.25×/2×) on a rate card validated to <0.5% of real Anthropic billing.
Efficiency Δ computed only over cells both arms solved; split-resolve, both-fail, and zero-work infra cells excluded.
Cost/token Δ shown only for Anthropic — gemini-cli / codex / opencode native arms don't report cost (solve-rate only).
ostk under-reports its own cost ~13%; figures here use the corrected per-bucket recompute, not ostk's self-estimate.

→ Full verified results, per-cell tables, and caveats (v7.6.0)

needle-bench AI Coding Agent Leaderboard

Per-cell results

Results

Key findings (v7.6.0)

Methodology