Benchmark run
Started Jun 9, 2026, 6:36 PM · Recorded Jun 9, 2026, 8:08 PM · Ended Jun 9, 2026, 8:08 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 9, 2026, 8:09 PM · qwen/qwen3-235b-a22b-2507
anthropic/claude-fable-54599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category filter), though results are truncated ("results_truncated": true)fail_fast: false)259/459 (56.4%)2033.76 / 3155 (64.5%)11.5s9.6s4.03s187.1 tok/s$18.5745,972 prompt + 365,991 completion = 411,963 total58/60 passed (96.7% pass rate) and 98.5% score. Strong across all difficulty levels, achieving 100% on easy and hard subcategories.29/30 passed (96.7% pass rate) and 88% score. Perfect on medium and easy levels.48/60 passed (80% pass rate), 85.6% score. Performance drops slightly on hard tasks (75% pass rate).8/9 passed (88.9% pass rate), 86.2% score. Only one failure on a hard test.6/60 passed (10% pass rate), 42.5% score. Performance is consistently poor across all difficulties, worst on hard (5% pass rate).21/60 passed (35% pass rate), 40.6% score. Struggles across the board, especially on easy tasks (35% pass rate despite lower difficulty).18/60 passed (30% pass rate), 75.4% score. Despite low pass rate, scores higher due to partial credit on complex constraint-based tasks.28/60 passed (46.7% pass rate), 58.2% score. Performance degrades with difficulty: 75% (easy), 40% (medium), 25% (hard).43/60 (71.7%)79.7%100% on some), but fails on edge cases like halluc-edge-rate-limiter and halluc-bug-label-statement.$5.18) despite lowest pass rate, due to long outputs (103,790 completion tokens).20s latency (e.g., coding-hard-diff-objects at 20.39s).19848 tok/s on reason-constraint-rollout-window) due to rapid final bursts.3–5s across categories, with reasoning having the highest average (5.03s).debug-concurrent-map-delete-v2, debug-microtask-race-v2 failed).Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)