BLXBench - Run run

Benchmark run

Started Apr 30, 2026, 8:13 AM · Recorded Apr 30, 2026, 9:34 AM · Ended Apr 30, 2026, 9:34 AM

Test suite v1 — Nutrition · 17bc604b897e…

15.5Blended scoreTests 373Models 1

Passed57

Failed316

Pass rate15.3%

Duration4836.4s

Categories7

Models1

Speed avg348.7 t/s

Speed TTFT28935ms

Est. cost (run)$0.31

Tokens (Σ results)18.3k / 73.5k

Submitted byBitslix

Run summary

Generated Apr 30, 2026, 9:35 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: moonshotai/kimi-k2.6
Total tests: 373
Categories: 7 (coding_ui, debugging, hallucination, reasoning, refactoring, security, speed)
Run mode: Full benchmark (no --limit or --fail-fast enforced; fail_fast was false)
Results completeness: results_truncated is true, indicating not all per-test details are included in the payload.

Performance

Overall pass rate: 57/373 (15.28%)
Total score: 57.85 out of 375 (15.43%)
Average latency: 10.97s
Median latency: 6.62s
Average TTFT (Time to First Token): 19.90s
Average output speed: 1199.51 tok/s
Total cost: $0.309

Category-level Patterns

Stronger Categories

Security: Highest pass rate at 14/60 (23.33%) with score_percent of 23.33.
Hallucination: Second-best at 11/60 (18.33%) pass rate, with high output speed (5010.4 tok/s).
Debugging & Refactoring: Both at 9/60 (15%) pass rate, but debugging had better hard-level performance (6/20 vs 5/20).

Weaker Categories

Reasoning: Lowest pass rate at 3/62 (4.84%) and lowest score_percent (6.25%). Only 1 of 20 hard tests passed.
Coding UI: 1/6 passed (16.67%), but 2 tests timed out (including breakout_game and neon_sign_flicker at easy level).
Speed: 10/65 (15.38%) pass rate, with speed::medium performing worst (1/21, 4.76%).

Notable Failures and Observations

Timeouts:
- coding_ui::breakout_game (hard) and neon_sign_flicker (easy) failed with "The operation timed out.", indicating potential issues with handling visual or complex UI generation tasks.
High TTFT:
- Average TTFT (19.90s) is unusually high compared to latency (10.97s), suggesting model struggles with initial response generation.
- coding_ui::easy had the highest average TTFT: 166.91s.
Cost Efficiency:
- Despite high output speed in hallucination (5010.4 tok/s), cost per test remained low due to small token counts.
- reasoning::easy had a very low cost per test (e.g., reasoning_easy_09_even_check cost only $0.00001881).
Output Speed Variance:
- Ranged from 74.02 tok/s in security to 5010.4 tok/s in hallucination, indicating category-specific performance divergence.

Summary

The moonshotai/kimi-k2.6 model shows modest performance overall (15.4% score), with notable strengths in security and hallucination detection, but significant weaknesses in reasoning and coding UI tasks. High TTFT and timeouts suggest latency bottlenecks, especially on complex or visual prompts. The model is cost-effective per test but struggles with consistency across difficulty levels.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

moonshotai/kimi-k2.637357/37315.4%10.97s$0.31

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v1 — Nutrition

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)