Benchmark run
Started Apr 30, 2026, 9:50 AM · Recorded Apr 30, 2026, 10:50 AM · Ended Apr 30, 2026, 10:50 AM
Test suite v1 — Nutrition · 17bc604b897e…
Generated Apr 30, 2026, 10:50 AM · qwen/qwen3-235b-a22b-2507
z-ai/glm-5.1373coding_ui, debugging, hallucination, reasoning, refactoring, security, speed--limit, --category, or --level specified)fail_fast: false)true (some test outcomes may be missing from this payload)52/373 (13.94%)52 out of a maximum possible 375 (13.87%)$0.39617,699 prompt + 90,593 completion = 108,292 total8.23s4.25s10.75s14/60 (23.33%) pass rate. Stronger on hard (6/20) and medium (6/20) levels.12/60 (20.00%) pass rate. Best on easy (6/20), drops to 3/20 on hard and medium.10/60 (16.67%) pass rate. Performance improves with difficulty: easy (2/20), medium (5/20), hard (3/20).8/65 (12.31%) pass rate. Best on easy (5/21), worst on medium (0/21).5/62 (8.06%) pass rate. Very weak across all levels, with only 1 pass on hard (1/20) and medium (3/20).3/60 (5.00%) pass rate. Consistently low across all levels (1/20 each).0/6 (0.00%) pass rate. All tests failed or timed out. Three medium-level tests timed out (thunderstorm_over_city, underwater_coral_reef, vinyl_record_player).10.75s) exceeds average latency (8.23s), suggesting many responses had delays before generation began.1,169.29 tok/s) but likely skewed by a few very fast, short completions. Some categories like hallucination show very high speeds (4,524.83 tok/s) due to short outputs.coding_ui ($0.088), driven by long outputs (e.g., breakout_game: 8,346 tokens).hallucination ($0.017), due to short responses.refactoring::hard has an average TTFT of 57.32s, significantly higher than other categories.speed::medium had 0 passes (0/21), indicating poor performance on moderately complex speed tasks.Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v1 — Nutrition
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)