Benchmark run
Started Apr 25, 2026, 9:21 PM · Recorded Apr 25, 2026, 10:23 PM · Ended Apr 25, 2026, 10:20 PM
Per-model aggregates from overall_ranking.json for this run id.
No matching report.json under results/ — charts use ranking or summary only.
Values are read from report.json when the benchmark wrote them.
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score.
366 tasks in 6 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)