BLXBench - Run run

Benchmark run

Started May 12, 2026, 3:39 PM · Recorded May 12, 2026, 7:26 PM · Ended May 12, 2026, 7:26 PM

Test suite v2 — Resilience · 045d4510abd0…

71.4Blended scoreTests 459Models 1

Passed216

Failed243

Pass rate47.1%

Duration13654.1s

Categories9

Models1

Speed avg47.9 t/s

Speed TTFT5952ms

Cost/strict$0.00

Strict success96.7%

Score/$n/a

Failed spend$0.00

P50 task cost$0.00

P90 task cost$0.00

Est. cost (run)$0.00

Tokens (Σ results)26.2k / 318.7k

Submitted byBitslix

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

baidu/cobuddy:free459216/45968.7%21.58s$0.00

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.3

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)