BLXBench - Run run

Benchmark run

Started Jun 1, 2026, 1:04 AM · Recorded Jun 1, 2026, 3:15 AM · Ended Jun 1, 2026, 3:15 AM

Test suite v2 — Resilience · 045d4510abd0…

77.3Blended scoreTests 459Models 1

Passed266

Failed193

Pass rate58.0%

Duration7873.2s

Categories9

Models1

Speed avg51.4 t/s

Speed TTFT1271ms

Cost/strict$0.0003

Strict success90.0%

Score/$387636.34

Failed spend$0.0006

P50 task cost$0.0002

P90 task cost$0.0004

Est. cost (run)$0.37

Tokens (Σ results)101.8k / 299.4k

Submitted byBitslix

Run summary

Generated Jun 1, 2026, 3:15 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: minimax/minimax-m3
Total tests: 459
Categories covered: All available categories were included (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no limiting flags like --limit or --category used)
Fail-fast mode: Disabled (fail_fast=false), so all tests were attempted even after failures
Results completeness: The full results are truncated — only a subset of test outcomes is included in this payload.

Performance Summary

Overall Results

Pass rate: 266/459 (57.95%)
Score: 2357.02 out of 3155 (74.71%)
Total cost: $0.3736
Latency:
- Average: 16.80s
- Median: 11.75s
- Average Time to First Token (TTFT): 1.32s
Output speed: 46.05 tokens/sec

Category-Level Performance

✅ Strong Performers

Cost:
- Pass rate: 27/30 (90%)
- Score: 132/150 (88%)
- Very low cost: $0.0068
- Fast output: 58.10 tok/s
Coding:
- Pass rate: 51/60 (85%)
- Score: 241/261 (92.34%)
- High efficiency: 74.65 tok/s
- Cost: $0.0268
Speed:
- Pass rate: 49/60 (81.67%)
- Score: 156/180 (86.67%)
- Fast generation: 51.44 tok/s
Hallucination:
- Pass rate: 41/60 (68.33%)
- Score: 291/360 (80.83%)
- Strong resistance to hallucination, especially on medium difficulty (95% pass)

⚠️ Moderate Performers

Security:
- Pass rate: 34/60 (56.67%)
- Score: 386/512 (75.39%)
- Slower output: 30.47 tok/s
Debugging:
- Pass rate: 36/60 (60%)
- Score: 420/601 (69.88%)
- Performance degrades on hard tests (50% pass rate)

❌ Weak Performers

Reasoning:
- Pass rate: 17/60 (28.33%)
- Score: 388/541 (71.72%)
- Very low pass rate across all levels (only 15% on easy)
Refactoring:
- Pass rate: 10/60 (16.67%)
- Score: 338/541 (62.48%)
- Consistently poor across all difficulty levels (15% max pass rate)
UI:
- Pass rate: 1/9 (11.11%)
- Only 1 test passed (ui::easy), all medium/hard failed
- High cost: $0.0 of total cost attributed here despite few tests

Difficulty-Level Insights

Easy tests: Generally well-handled, especially in coding (100% pass) and cost (80% pass)
Medium tests: Performance drops in debugging, reasoning, and refactoring (45%, 45%, 15% pass respectively)
Hard tests: Significant struggles in reasoning (25% pass), refactoring (15%), and ui (0%)

Notable Failures and Observations

coding-hard-topological-sort:
- Failed with null TTFT and latency of 120.16s — likely timed out
reasoning category:
- Repeated failures on constraint reasoning tasks, even at easy level (e.g., reason-constraint-disk-quota, reason-rc-login-timeout)
refactoring:
- Extremely low pass rate (10/60) with no level exceeding 15% success
ui category:
- Only ui::easy passed; all 6 medium and 2 hard tests failed
Latency outliers:
- Some debugging and reasoning tests exceeded 20s, with max at 34.75s (reason-constraint-consistency-latency)
Cost concentration:
- refactoring was most expensive category: $0.0780 (20.9% of total cost), driven by high token usage (64k completion tokens)

Summary

The minimax/minimax-m3 model performs strongly in coding, cost optimization, and speed tasks, showing fast response times and high accuracy. It also resists hallucination well, especially on medium-difficulty prompts.

However, it struggles significantly with reasoning, refactoring, and UI tasks, with pass rates below 17% in the latter two. The model’s high failure rate on logical and architectural reasoning, even at easy levels, suggests limitations in deep program comprehension.

Despite moderate overall score (74.7%), the high cost and latency in low-pass-rate categories indicate inefficiency when handling complex or nuanced software engineering tasks.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

minimax/minimax-m3459266/45974.7%16.80s$0.37

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)