Benchmark run
Started May 1, 2026, 8:27 AM · Recorded May 1, 2026, 9:24 AM · Ended May 1, 2026, 9:24 AM
Test suite v1 — Nutrition · 17bc604b897e…
Generated May 1, 2026, 9:25 AM · qwen/qwen3-235b-a22b-2507
x-ai/grok-4.3373coding_ui, debugging, hallucination, reasoning, refactoring, security, speed--limit, --category, or --level filters applied)fail_fast: false)results_truncated: true), but aggregate summary is complete.321/373 (86.06%)320.53 out of 375 (85.48%)9.09s5.21s8.20s9941.53 tok/s$0.6960/60, 100%) and full score (60/60). Extremely fast output speed (23970.27 tok/s) and low TTFT (2.44s).60/62, 96.77%) and strong score (60/64, 93.75%). Fast output (25771.19 tok/s) and low TTFT (3.32s).57/60, 95%) and score (57/60, 95%). Moderate output speed (407.22 tok/s), TTFT (4.65s).6/6, 100%) and high score (5.535/6, 92.25%). High cost ($0.15) and long TTFT (88.44s) due to complex UI generation tasks.53/65 (81.54%). Includes speed::medium with very low pass rate (10/21, 47.62%), dragging down category performance.44/60 (73.33%). Performance varies by difficulty: hard (16/20) > easy (15/20) > medium (13/20).41/60 (68.33%). Consistently lower across difficulty levels, with easy (12/20) being weakest.reasoning::easy (3.19s avg TTFT) and hallucination::easy (2.09s avg TTFT).hallucination::hard (40630.25 tok/s), indicating rapid factual denial responses.coding_ui ($0.15) due to high token output in UI generation.debugging::medium (7/20 failed), including several bugfix and logic errors.speed::medium (10/21 passed, 47.62%).373 tests were executed.easy and medium levels, including debugging_easy_02_fix_append, debugging_easy_04_fix_div_zero_guard, and debugging_medium_13_bugfix (very high latency: 92.30s).reasoning_medium_01_weighted_average, despite strong performance elsewhere.speed::medium had 11 failures, suggesting challenges with mid-complexity timing-sensitive tasks.Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v1 — Nutrition
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)