BLXBench - Run run

Benchmark run

Started Apr 27, 2026, 3:04 PM · Recorded Apr 27, 2026, 4:37 PM · Ended Apr 27, 2026, 4:37 PM

Test suite v1 — Nutrition · 15790b59b787…

55.1Blended scoreTests 732Models 2

Passed403

Failed329

Pass rate55.1%

Duration5529.9s

Categories6

Models2

Speed avg1370.9 t/s

Speed TTFT7985ms

Est. cost (run)$8.96

Tokens (Σ results)32.5k / 75.1k

Submitted byBitslix

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

openai/gpt-5.5366240/36665.6%3.91s$1.08

openai/gpt-5.5-pro366163/36644.5%11.55s$7.88

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v1 — Nutrition

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Model comparison (score %)

Overall score per model for this run (from overall_ranking run_models). Shown when more than one model participated.

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

732 tasks in 6 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)