BLXBench - Run run

Benchmark run

Started May 10, 2026, 4:22 PM · Recorded May 10, 2026, 6:26 PM · Ended May 10, 2026, 6:26 PM

Test suite v2 — Resilience · 045d4510abd0…

51.9Blended scoreTests 459Models 1

Passed148

Failed311

Pass rate32.2%

Duration7459.2s

Categories9

Models1

Speed avg449.8 t/s

Speed TTFT1543ms

Cost/strict$0.0011

Strict success83.3%

Score/$87956.24

Failed spend$0.0051

P50 task cost$0.0009

P90 task cost$0.0012

Est. cost (run)$1.04

Tokens (Σ results)43.8k / 613.4k

Submitted byBitslix

Run summary

Generated May 10, 2026, 6:26 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: minimax/minimax-m2.7
Total tests run: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full benchmark (no --limit or --fail-fast used)
Results completeness: results_truncated: true — output was truncated, but aggregate data is complete.

Performance Summary

Overall Results

Pass rate: 148/459 (32.24%)
Score: 1601.07 out of 3137 (51.04%)
Total cost: $1.04
Total tokens: 43765 prompt + 613398 completion = 657163 tokens
Latency: Average 15.91s, median 9.40s
Time to first token (TTFT): Average 1.29s
Output speed: Average 211.77 tokens/second

Category-Level Performance

Pass Rate & Score by Category

cost: 25/30 (83.33% pass rate), score 117/150 (78%) — strongest category
speed: 33/60 (55.00%), score 107/180 (59.44%)
security: 16/60 (26.67%), score 311/512 (60.74%)
hallucination: 21/60 (35.00%), score 202/360 (56.11%)
coding: 26/60 (43.33%), score 120/261 (45.98%)
reasoning: 8/60 (13.33%), score 322/541 (59.52%)
debugging: 14/60 (23.33%), score 265/583 (45.45%)
refactoring: 4/60 (6.67%), score 156/541 (28.84%)
ui: 1/9 (11.11%), score 1.07/9 (11.86%) — weakest category

Notable Latency & Speed Observations

Fastest output speed:
- refactoring: 361.52 tok/s
- security: 458.31 tok/s
- speed: 449.76 tok/s
Slowest output speed: coding (57.52 tok/s), debugging (60.10 tok/s)
Lowest TTFT: security (0.62s avg), refactoring (0.85s)
Highest TTFT: ui (2.92s avg)

Notable Failures & Patterns

High-Failure Categories

refactoring: Only 4/60 passed — extremely poor performance across all difficulty levels (max pass_rate: 15% on hard).
ui: Only 1/9 passed, with 0% on easy and hard sublevels.
reasoning: Only 8/60 passed, with 5% pass rate on easy tasks — suggests fundamental reasoning gaps.
debugging: 14/60 passed, with no level exceeding 30% pass rate.

Difficulty Trends

coding: Strong on easy (80% pass), weak on medium (35%) and hard (15%)
cost: Consistently high across levels (80–90% pass)
hallucination: Poor on easy (25%) and medium (25%), but 55% on hard — unusual inverse trend
security: Best on medium (40%), worst on easy (15%)
refactoring: 0% on medium, 15% on hard — no clear trend

Cost & Efficiency

Most expensive category: refactoring ($0.25), due to long outputs (118708 completion tokens)
Cheapest category: cost ($0.0266), despite high pass rate
Highest token efficiency: cost — high pass rate at low cost
Lowest efficiency: refactoring — high cost, very low pass rate

Failures with High Impact

coding-hard tasks mostly failed, including astar-grid, bloom-filter, dijkstra, expression-evaluator, trie-autocomplete
debugging had multiple 0-score failures on critical issues like debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 (both errored with iterator issue)
hallucination showed mixed results: passed factual API checks (node-crypto, stream-pipeline) but failed basic behavior (event-loop-order, structuredclone)
reasoning failed most constraint-based logic, especially on easy tasks like reason-constraint-api-version-compat and reason-rc-login-timeout

Conclusion

The minimax/minimax-m2.7 model demonstrates strong cost-awareness and factual API knowledge, but struggles severely with code generation (especially refactoring), reasoning, and UI tasks. It is fast in high-throughput categories like security and speed, but slow to start generating in ui and coding. While cost-efficient in some areas, its poor pass rate in complex coding and reasoning tasks limits practical utility. Significant hallucination and debugging blind spots suggest caution in production use.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

minimax/minimax-m2.7459148/45951.0%15.91s$1.04

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)