Benchmark run
Started May 10, 2026, 4:22 PM · Recorded May 10, 2026, 6:26 PM · Ended May 10, 2026, 6:26 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 6:26 PM · qwen/qwen3-235b-a22b-2507
minimax/minimax-m2.74599 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui--limit or --fail-fast used)results_truncated: true — output was truncated, but aggregate data is complete.148/459 (32.24%)1601.07 out of 3137 (51.04%)$1.0443765 prompt + 613398 completion = 657163 tokens15.91s, median 9.40s1.29s211.77 tokens/second25/30 (83.33% pass rate), score 117/150 (78%) — strongest category33/60 (55.00%), score 107/180 (59.44%)16/60 (26.67%), score 311/512 (60.74%)21/60 (35.00%), score 202/360 (56.11%)26/60 (43.33%), score 120/261 (45.98%)8/60 (13.33%), score 322/541 (59.52%)14/60 (23.33%), score 265/583 (45.45%)4/60 (6.67%), score 156/541 (28.84%)1/9 (11.11%), score 1.07/9 (11.86%) — weakest categoryrefactoring: 361.52 tok/ssecurity: 458.31 tok/sspeed: 449.76 tok/scoding (57.52 tok/s), debugging (60.10 tok/s)security (0.62s avg), refactoring (0.85s)ui (2.92s avg)4/60 passed — extremely poor performance across all difficulty levels (max pass_rate: 15% on hard).1/9 passed, with 0% on easy and hard sublevels.8/60 passed, with 5% pass rate on easy tasks — suggests fundamental reasoning gaps.14/60 passed, with no level exceeding 30% pass rate.80% pass), weak on medium (35%) and hard (15%)80–90% pass)25%) and medium (25%), but 55% on hard — unusual inverse trend40%), worst on easy (15%)0% on medium, 15% on hard — no clear trendrefactoring ($0.25), due to long outputs (118708 completion tokens)cost ($0.0266), despite high pass ratecost — high pass rate at low costrefactoring — high cost, very low pass rateastar-grid, bloom-filter, dijkstra, expression-evaluator, trie-autocomplete0-score failures on critical issues like debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 (both errored with iterator issue)node-crypto, stream-pipeline) but failed basic behavior (event-loop-order, structuredclone)reason-constraint-api-version-compat and reason-rc-login-timeoutThe minimax/minimax-m2.7 model demonstrates strong cost-awareness and factual API knowledge, but struggles severely with code generation (especially refactoring), reasoning, and UI tasks. It is fast in high-throughput categories like security and speed, but slow to start generating in ui and coding. While cost-efficient in some areas, its poor pass rate in complex coding and reasoning tasks limits practical utility. Significant hallucination and debugging blind spots suggest caution in production use.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)