Benchmark run
Started May 10, 2026, 9:25 PM · Recorded May 10, 2026, 10:23 PM · Ended May 10, 2026, 10:23 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 10:23 PM · qwen/qwen3-235b-a22b-2507
xiaomi/mimo-v2.5-pro4599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category restriction), but results are truncated in output.fail_fast=false)248/459 (54.03%)2291.76 / 3137 (73.06%)$0.807.42s4.75s0.81s77.91 tok/s96.17% (55/60 passed)
easy and hard levels, with 100% pass rate on coding::easy.87.33% (29/30 passed)
100% on cost::medium and cost::easy.89.44% (51/60 passed)
96.67% on speed::easy.73.76% (26/60 passed)
60% (easy) → 40% (medium) → 30% (hard)76.39% (30/60 passed)
hard (65% pass) vs easy (35% pass)71.53% (18/60 passed)
medium (15% pass) and hard (35% pass), despite 40% on easy67.97% (26/60 passed)
35% (easy) → 50% (medium) → 45% (hard)55.82% (5/60 passed)
≤15% pass rate)75.06% (8/9 passed)
ui::easy, but perfect on medium and hardrefactoring ($0.16), despite lowest pass rate.refactoring (250,156 completion tokens), suggesting verbose but incorrect outputs.coding (0.74s avg), indicating fast initial response for coding tasks.hallucination::easy (e.g., halluc-edge-pagination, halluc-api-array-flat).debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", indicating fundamental misunderstanding.cost-complex-retry-with-backoff failed with high cost ($0.0031) and long latency (11.44s), suggesting inefficient or incorrect retry logic generation.xiaomi/mimo-v2.5-pro performs strongly in coding, cost, and speed tasks, with fast response times and high accuracy. It struggles significantly with refactoring, reasoning under constraints, and debugging complex concurrency issues. Hallucination resistance is moderate but inconsistent, with notable failures on edge cases and API behavior. The model is cost-efficient overall ($0.80 for 459 tests), but generates high token volume in low-scoring areas like refactoring.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)