Benchmark run
Started Jun 1, 2026, 1:04 AM · Recorded Jun 1, 2026, 3:15 AM · Ended Jun 1, 2026, 3:15 AM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 1, 2026, 3:15 AM · qwen/qwen3-235b-a22b-2507
minimax/minimax-m3459coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category used)fail_fast=false), so all tests were attempted even after failures266/459 (57.95%)2357.02 out of 3155 (74.71%)$0.373616.80s11.75s1.32s46.05 tokens/sec27/30 (90%)132/150 (88%)$0.006858.10 tok/s51/60 (85%)241/261 (92.34%)74.65 tok/s$0.026849/60 (81.67%)156/180 (86.67%)51.44 tok/s41/60 (68.33%)291/360 (80.83%)95% pass)34/60 (56.67%)386/512 (75.39%)30.47 tok/s36/60 (60%)420/601 (69.88%)50% pass rate)17/60 (28.33%)388/541 (71.72%)15% on easy)10/60 (16.67%)338/541 (62.48%)15% max pass rate)1/9 (11.11%)1 test passed (ui::easy), all medium/hard failed$0.0 of total cost attributed here despite few testscoding (100% pass) and cost (80% pass)debugging, reasoning, and refactoring (45%, 45%, 15% pass respectively)reasoning (25% pass), refactoring (15%), and ui (0%)coding-hard-topological-sort:
null TTFT and latency of 120.16s — likely timed outreasoning category:
reason-constraint-disk-quota, reason-rc-login-timeout)refactoring:
10/60) with no level exceeding 15% successui category:
ui::easy passed; all 6 medium and 2 hard tests faileddebugging and reasoning tests exceeded 20s, with max at 34.75s (reason-constraint-consistency-latency)refactoring was most expensive category: $0.0780 (20.9% of total cost), driven by high token usage (64k completion tokens)The minimax/minimax-m3 model performs strongly in coding, cost optimization, and speed tasks, showing fast response times and high accuracy. It also resists hallucination well, especially on medium-difficulty prompts.
However, it struggles significantly with reasoning, refactoring, and UI tasks, with pass rates below 17% in the latter two. The model’s high failure rate on logical and architectural reasoning, even at easy levels, suggests limitations in deep program comprehension.
Despite moderate overall score (74.7%), the high cost and latency in low-pass-rate categories indicate inefficiency when handling complex or nuanced software engineering tasks.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)