Benchmark run
Started May 10, 2026, 9:25 PM · Recorded May 10, 2026, 10:11 PM · Ended May 10, 2026, 10:11 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 10:11 PM · qwen/qwen3-235b-a22b-2507
xiaomi/mimo-v2.5459category, level, or limit)fail_fast = false)results_truncated = true), meaning only a subset of test outcomes are shown in results_compact254/459 (55.3%)2314.67 out of 3137 (73.8%)5.57s4.13s1.24s120.75 tok/s$0.45140,034 prompt + 214,469 completion = 354,503 total54/60 passed (90% pass rate), 94.4% score — strongest category30/30 passed (100% pass rate), 90.7% score — perfect pass rate52/60 passed (86.7%), 93.9% score — excels in correctness and efficiency9/9 passed (100%), 85.2% score — fully passes all tests13/60 passed (21.7%), 66.0% score — weakest pass rate14/60 passed (23.3%), 56.3% score — struggles with secure coding patterns18/60 passed (30.0%), 73.2% score — poor on constraint-based logic31/60 passed (51.7%), 74.7% score — moderate hallucination rate2.37s) due to one very slow test (coding::easy-max-by-key at 2.35s TTFT).1.34s) and longest median latencies.$0.092, due to high completion token usage (45,844 tokens).$0.093, despite lower pass rate.reason-constraint-consistency-latency, reason-rc-null-pointer-call).easy (4/20 passed).49.5%) despite being easiest level — indicates fundamental gaps.debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function"halluc-api-array-flat, halluc-api-atomics-wait).halluc-doc-* and halluc-edge-* tests.$0.001 each due to long outputs.xiaomi/mimo-v2.5 performs well in coding, cost optimization, and speed, but struggles significantly with reasoning, refactoring, and security tasks. It shows a tendency to hallucinate APIs and misunderstand edge cases in distributed systems and constraint logic. While cost-efficient overall, it incurs higher costs in UI and reasoning due to verbose outputs. The model is reliable for straightforward coding tasks but less trustworthy for complex system reasoning or secure code generation.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)