BLXBench - Run run

Benchmark run

Started May 10, 2026, 9:25 PM · Recorded May 10, 2026, 10:23 PM · Ended May 10, 2026, 10:23 PM

Test suite v2 — Resilience · 045d4510abd0…

76.6Blended scoreTests 459Models 1

Passed248

Failed211

Pass rate54.0%

Duration3493.6s

Categories9

Models1

Speed avg85.2 t/s

Speed TTFT807ms

Cost/strict$0.0006

Strict success96.7%

Score/$139878.05

Failed spend$0.0031

P50 task cost$0.0004

P90 task cost$0.0011

Est. cost (run)$0.80

Tokens (Σ results)141.9k / 250.2k

Submitted byBitslix

Run summary

Generated May 10, 2026, 10:23 PM · qwen/qwen3-235b-a22b-2507

Scope

Single model tested: xiaomi/mimo-v2.5-pro
Total tests run: 459
Categories covered: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category restriction), but results are truncated in output.
Fail-fast mode: Disabled (fail_fast=false)

Performance Summary

Overall pass rate: 248/459 (54.03%)
Score: 2291.76 / 3137 (73.06%)
Total cost: $0.80
Latency:
- Average: 7.42s
- Median: 4.75s
- Average TTFT (Time to First Token): 0.81s
Output speed: 77.91 tok/s

Category-Level Performance

Strong Performers (>85% score)

Coding: 96.17% (55/60 passed)
- Excels in easy and hard levels, with 100% pass rate on coding::easy.
Cost: 87.33% (29/30 passed)
- High consistency across difficulty levels, including 100% on cost::medium and cost::easy.
Speed: 89.44% (51/60 passed)
- Strong across all levels, with 96.67% on speed::easy.

Moderate Performers (70–80% score)

Debugging: 73.76% (26/60 passed)
- Performance declines with difficulty: 60% (easy) → 40% (medium) → 30% (hard)
Hallucination: 76.39% (30/60 passed)
- Better on hard (65% pass) vs easy (35% pass)
Reasoning: 71.53% (18/60 passed)
- Struggles with medium (15% pass) and hard (35% pass), despite 40% on easy

Weak Performers (<70% score)

Security: 67.97% (26/60 passed)
- Degrades with difficulty: 35% (easy) → 50% (medium) → 45% (hard)
Refactoring: 55.82% (5/60 passed)
- Extremely poor performance across all levels (≤15% pass rate)
UI: 75.06% (8/9 passed)
- One failure on ui::easy, but perfect on medium and hard

Notable Observations

Costliest category: refactoring ($0.16), despite lowest pass rate.
Highest token output: refactoring (250,156 completion tokens), suggesting verbose but incorrect outputs.
Lowest TTFT: coding (0.74s avg), indicating fast initial response for coding tasks.
High hallucination on easy tasks: Despite high overall reasoning scores, model frequently fails hallucination::easy (e.g., halluc-edge-pagination, halluc-api-array-flat).
Critical failures:
- debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", indicating fundamental misunderstanding.
- cost-complex-retry-with-backoff failed with high cost ($0.0031) and long latency (11.44s), suggesting inefficient or incorrect retry logic generation.

Conclusion

xiaomi/mimo-v2.5-pro performs strongly in coding, cost, and speed tasks, with fast response times and high accuracy. It struggles significantly with refactoring, reasoning under constraints, and debugging complex concurrency issues. Hallucination resistance is moderate but inconsistent, with notable failures on edge cases and API behavior. The model is cost-efficient overall ($0.80 for 459 tests), but generates high token volume in low-scoring areas like refactoring.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

xiaomi/mimo-v2.5-pro459248/45973.1%7.42s$0.80

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)