Benchmark run

run_1aa6fc

Started Apr 24, 2026, 7:45 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 7:50 PM

73.6Blended scoreTests 7Models 1

Passed5

Failed2

Pass rate71.4%

Duration256.4s

Categories7

Models1

Est. cost (run)$0.00

Submitted byBitslix

Tokens (Σ results)763 / 7.0k

Run summary

Generated Apr 24, 2026, 7:50 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: minimax/minimax-m2.5:free
Total tests run: 7
Categories covered: 7 — coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Run mode: Full run (--fail-fast disabled), all tests executed
Rate limit: 7 RPM
Cost: Total cost \$0.00 (all completions free)

Performance

Overall pass rate: 5/7 (71.4%)
Total score: 5.15/7 (73.6%)
Average latency: 34.8s
Median latency: 11.2s
Average TTFT (Time to First Token): 22.4s
Average output speed: 767.0 tok/s

The high average latency is skewed by the coding_ui test (137.1s), while most other tests complete in under 12s. TTFT varies significantly across categories, from 2.7s in hallucination to 90.4s in coding_ui.

Category-Level Results

coding_ui:
- Passed: 0/1
- Score: 0.15/1 (15%)
- Latency: 137.1s (highest of all tests)
- TTFT: 90.4s
- Output speed: 112.5 tok/s (lowest observed)
- Failure reason: Model generated a partial UI with excessive boilerplate but failed core functionality
debugging, hallucination, reasoning, security, speed:
- All passed (1/1 in each)
- Perfect scores: 1.0 in all five
- Fast TTFT: All under 28s, fastest at 2.7s (hallucination)
- High output speed: Up to 1.7k tok/s (reasoning, debugging)
refactoring:
- Passed: 0/1
- Score: 0/1
- Latency: 12.4s
- TTFT: 11.5s
- Output speed: 542.0 tok/s
- Failure reason: Output used list comprehension but did not match expected format or logic exactly

Notable Observations

Extreme TTFT in coding_ui: 90.4s suggests potential inefficiency or blocking behavior when generating UI code
Low output speed in security: Only 9.1 tok/s despite correct, detailed answer — likely due to deliberate, verbose explanation
Zero cost across all tests: Confirmed no token-based charges for this model endpoint
Token totals:
- Prompt tokens: 763
- Completion tokens: 6,976
- Total: 7,739 tokens processed

Despite two failures, the model demonstrates strong reasoning, security awareness, and speed in most domains. The coding_ui and refactoring failures suggest weaknesses in code generation fidelity under specific patterns.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

minimax/minimax-m2.5:free75/773.6%34.81s$0.00

No matching report.json under results/ — charts use ranking or summary only.

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Discovery

Limited — up to 1 test(s) per category

blxbench argv

tui

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests per category

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.

7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)