Benchmark run
Started May 30, 2026, 9:04 PM · Recorded May 30, 2026, 9:53 PM · Ended May 30, 2026, 9:53 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 30, 2026, 9:53 PM · qwen/qwen3-235b-a22b-2507
stepfun/step-3.7-flash459category or level CLI options)fail_fast)results_compact)175/459 (38.1%)1742.55 / 3155 (55.2%)6.14s4.28s0.68s257.26 tok/s (average)$0.73668.3% pass rate, 75.8% score — best in both metrics.48.3% pass rate, 62.4% score — solid reasoning on bugs.48.3% pass rate, 73.0% score — good detection of vulnerabilities.63.3% pass rate, 66% score — effective at identifying inefficient code.3.3% pass rate (2/60), 17.7% score — severely struggles with code improvement tasks.0% pass rate (0/9), 6.1% score — completely fails UI-related reasoning.23.3% pass rate (14/60), 25.7% score — poor on general programming tasks.23.3% pass rate (14/60), 67.1% score — low pass rate despite moderate scoring.1.24s, likely due to complex analysis.$0.088 for 20 tests), reflecting longer, more token-intensive tasks.hard difficulty tests (0/20), including algorithmic challenges like astar-grid, dijkstra, and segment-tree.1 test in each of easy, medium, and hard levels — consistent poor performance.9 tests across all difficulties — no successful outputs.67.1% score, pass rate is only 23.3%, suggesting partial credit was common but full correctness rare.cost category, the model often generates correct code but fails short fixes (e.g., cost-short-fix-sum-of-n, cost-short-fix-factorial-base-case failed despite simple logic).debug-db-transaction-isolation-v2 and debug-timezone-dst-bucket-v2, but failed basic ones like debug-missing-await-v2 and debug-env-flag-string-v2.The stepfun/step-3.7-flash model shows strongest capability in detecting hallucinations and security flaws, with decent performance in debugging and cost-awareness. However, it struggles severely with coding, refactoring, and UI tasks, and fails to reliably perform simple code fixes. While it generates responses quickly (0.68s TTFT), its low pass rate in core programming categories limits practical utility. Total cost of $0.74 is moderate for this scale of testing.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)