Benchmark run

run_8e2200

Started Apr 25, 2026, 9:21 PM · Recorded Apr 25, 2026, 10:23 PM · Ended Apr 25, 2026, 10:20 PM

27.9Blended scoreTests 366Models 1

Passed102

Failed264

Pass rate27.9%

Duration3538.0s

Categories6

Models1

Est. cost (run)$0.12

Submitted byBitslix

Tokens (Σ results)9.3k / 42.1k

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

deepseek/deepseek-v4-pro366102/36627.9%7.25s$0.12

No matching report.json under results/ — charts use ranking or summary only.

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests per category

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score.

366 tasks in 6 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)