Benchmark run
Started Apr 24, 2026, 7:45 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 7:50 PM
Generated Apr 24, 2026, 7:50 PM · qwen/qwen3-235b-a22b-2507
minimax/minimax-m2.5:free77 — coding_ui, debugging, hallucination, reasoning, refactoring, security, speed--fail-fast disabled), all tests executed7 RPM\$0.00 (all completions free)5/7 (71.4%)5.15/7 (73.6%)34.8s11.2s22.4s767.0 tok/sThe high average latency is skewed by the coding_ui test (137.1s), while most other tests complete in under 12s. TTFT varies significantly across categories, from 2.7s in hallucination to 90.4s in coding_ui.
coding_ui:
0/10.15/1 (15%)137.1s (highest of all tests)90.4s112.5 tok/s (lowest observed)debugging, hallucination, reasoning, security, speed:
1/1 in each)1.0 in all five28s, fastest at 2.7s (hallucination)1.7k tok/s (reasoning, debugging)refactoring:
0/10/112.4s11.5s542.0 tok/scoding_ui: 90.4s suggests potential inefficiency or blocking behavior when generating UI codesecurity: Only 9.1 tok/s despite correct, detailed answer — likely due to deliberate, verbose explanation7636,9767,739 tokens processedDespite two failures, the model demonstrates strong reasoning, security awareness, and speed in most domains. The coding_ui and refactoring failures suggest weaknesses in code generation fidelity under specific patterns.
Per-model aggregates from overall_ranking.json for this run id.
No matching report.json under results/ — charts use ranking or summary only.
Values are read from report.json when the benchmark wrote them.
Discovery
Limited — up to 1 test(s) per category
blxbench argv
tui
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.
7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)