BLXBench - Run run

Benchmark run

Started May 10, 2026, 9:25 PM · Recorded May 10, 2026, 10:11 PM · Ended May 10, 2026, 10:11 PM

Test suite v2 — Resilience · 045d4510abd0…

77.5Blended scoreTests 459Models 1

Passed254

Failed205

Pass rate55.3%

Duration2778.6s

Categories9

Models1

Speed avg126.4 t/s

Speed TTFT1528ms

Cost/strict$0.0003

Strict success100.0%

Score/$288561.49

Failed spend$0.00

P50 task cost$0.0003

P90 task cost$0.0005

Est. cost (run)$0.45

Tokens (Σ results)140.0k / 214.5k

Submitted byBitslix

Run summary

Generated May 10, 2026, 10:11 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: xiaomi/mimo-v2.5
Total tests: 459
Categories: All categories included (no filtering via category, level, or limit)
Run mode: Full run (fail_fast = false)
Results truncated: Yes (results_truncated = true), meaning only a subset of test outcomes are shown in results_compact

Performance Summary

Pass rate: 254/459 (55.3%)
Score: 2314.67 out of 3137 (73.8%)
Average latency: 5.57s
Median latency: 4.13s
Average TTFT (Time to First Token): 1.24s
Average output speed: 120.75 tok/s
Total cost: $0.45
Tokens processed: 140,034 prompt + 214,469 completion = 354,503 total

Category-Level Performance

High-Performing Categories

Speed: 54/60 passed (90% pass rate), 94.4% score — strongest category
Cost: 30/30 passed (100% pass rate), 90.7% score — perfect pass rate
Coding: 52/60 passed (86.7%), 93.9% score — excels in correctness and efficiency
UI: 9/9 passed (100%), 85.2% score — fully passes all tests

Low-Performing Categories

Refactoring: 13/60 passed (21.7%), 66.0% score — weakest pass rate
Security: 14/60 passed (23.3%), 56.3% score — struggles with secure coding patterns
Reasoning: 18/60 passed (30.0%), 73.2% score — poor on constraint-based logic
Hallucination: 31/60 passed (51.7%), 74.7% score — moderate hallucination rate

Notable Observations

Latency patterns:
- High average TTFT in coding::easy (2.37s) due to one very slow test (coding::easy-max-by-key at 2.35s TTFT).
- Reasoning has the highest average TTFT (1.34s) and longest median latencies.
Cost per category:
- UI is most expensive category: $0.092, due to high completion token usage (45,844 tokens).
- Reasoning second highest: $0.093, despite lower pass rate.
Failures under complexity:
- Multiple failures in reasoning on constraint and race condition logic (e.g., reason-constraint-consistency-latency, reason-rc-null-pointer-call).
- Refactoring fails most tests across all difficulty levels, especially easy (4/20 passed).
- Security::easy has lowest score (49.5%) despite being easiest level — indicates fundamental gaps.

Failure Highlights

Critical logic errors:
- debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function"
- Suggests model generated invalid JavaScript involving spread syntax on non-iterables.
Hallucination cases:
- Falsely claims existence of non-standard APIs (e.g., halluc-api-array-flat, halluc-api-atomics-wait).
- Incorrect behavior descriptions in halluc-doc-* and halluc-edge-* tests.
High-cost failures:
- Several reasoning and debugging failures cost over $0.001 each due to long outputs.

Conclusion

xiaomi/mimo-v2.5 performs well in coding, cost optimization, and speed, but struggles significantly with reasoning, refactoring, and security tasks. It shows a tendency to hallucinate APIs and misunderstand edge cases in distributed systems and constraint logic. While cost-efficient overall, it incurs higher costs in UI and reasoning due to verbose outputs. The model is reliable for straightforward coding tasks but less trustworthy for complex system reasoning or secure code generation.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

xiaomi/mimo-v2.5459254/45973.8%5.57s$0.45

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)