Benchmark run
Started Apr 30, 2026, 8:13 AM · Recorded Apr 30, 2026, 9:34 AM · Ended Apr 30, 2026, 9:34 AM
Test suite v1 — Nutrition · 17bc604b897e…
Generated Apr 30, 2026, 9:35 AM · qwen/qwen3-235b-a22b-2507
moonshotai/kimi-k2.63737 (coding_ui, debugging, hallucination, reasoning, refactoring, security, speed)--limit or --fail-fast enforced; fail_fast was false)results_truncated is true, indicating not all per-test details are included in the payload.57/373 (15.28%)57.85 out of 375 (15.43%)10.97s6.62s19.90s1199.51 tok/s$0.30914/60 (23.33%) with score_percent of 23.33.11/60 (18.33%) pass rate, with high output speed (5010.4 tok/s).9/60 (15%) pass rate, but debugging had better hard-level performance (6/20 vs 5/20).3/62 (4.84%) and lowest score_percent (6.25%). Only 1 of 20 hard tests passed.1/6 passed (16.67%), but 2 tests timed out (including breakout_game and neon_sign_flicker at easy level).10/65 (15.38%) pass rate, with speed::medium performing worst (1/21, 4.76%).coding_ui::breakout_game (hard) and neon_sign_flicker (easy) failed with "The operation timed out.", indicating potential issues with handling visual or complex UI generation tasks.19.90s) is unusually high compared to latency (10.97s), suggesting model struggles with initial response generation.coding_ui::easy had the highest average TTFT: 166.91s.5010.4 tok/s), cost per test remained low due to small token counts.reasoning::easy had a very low cost per test (e.g., reasoning_easy_09_even_check cost only $0.00001881).74.02 tok/s in security to 5010.4 tok/s in hallucination, indicating category-specific performance divergence.The moonshotai/kimi-k2.6 model shows modest performance overall (15.4% score), with notable strengths in security and hallucination detection, but significant weaknesses in reasoning and coding UI tasks. High TTFT and timeouts suggest latency bottlenecks, especially on complex or visual prompts. The model is cost-effective per test but struggles with consistency across difficulty levels.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v1 — Nutrition
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)