Benchmark run
Started Apr 27, 2026, 3:04 PM · Recorded Apr 27, 2026, 4:37 PM · Ended Apr 27, 2026, 4:37 PM
Test suite v1 — Nutrition · 15790b59b787…
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v1 — Nutrition
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Overall score per model for this run (from overall_ranking run_models). Shown when more than one model participated.
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
732 tasks in 6 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)