Benchmark run
Started May 10, 2026, 7:53 PM · Recorded May 10, 2026, 8:09 PM · Ended May 10, 2026, 8:09 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 8:09 PM · qwen/qwen3-235b-a22b-2507
google/gemini-3.1-flash-lite4599 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui--limit or --fail-fast)"results_truncated": true), but summary aggregates are complete.201/459 (43.79%)2015.13 / 3137 (64.24%)$0.22971.72s1.33s0.81s504.70 tok/s55/60 passed (91.67%), score 175/180 (97.22%)29/30 passed (96.67%), score 132/150 (88.00%)50/60 passed (83.33%), score 246/261 (94.25%)0/60 passed (0.00%), score 237/541 (43.81%)5/60 passed (8.33%), score 270/541 (49.91%)4/9 passed (44.44%), score 3.13/9 (34.77%)coding::easy achieving 95% pass rate and 98.88% score.easy and medium levels (100% and 90% pass rates respectively).100% pass rate on hard level (speed::hard), and 95–100% across all levels.0% pass rate) across all difficulty levels, despite moderate scoring due to partial credit.2/20 passed on easy and 1/20 on medium. The hardest reasoning tasks (reasoning::hard) see only 2 passes.40% pass rate overall, with debugging::hard dropping to 25% pass rate.21/60 (35%)halluc-api-array-flat, halluc-api-generator-return)halluc-edge-float-precision and halluc-bug-typeof-null, passing those.cost, debugging, hallucination, security, all averaging ~0.65s TTFT.reasoning (1.44s TTFT), likely due to complex multi-step analysis.cost (1374.58 tok/s), likely due to concise, factual responses.debugging (300.69 tok/s), possibly due to verbose diagnostic reasoning.reasoning ($0.0763), due to long outputs and high token counts (50,918 completion tokens).cost ($0.0051), reflecting efficient, short responses.150,132 (vs 27,309 prompt tokens), indicating verbose model outputs.debugging category: Two tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", suggesting code generation issues.reasoning category: Nearly all failures involve misapplying constraints (e.g., time windows, policy merges, rate limits), indicating weak logical consistency.refactoring category: Universal failure suggests model lacks understanding of code structure transformation or intent preservation.Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)