BLXBench - Run run

Benchmark run

Started Apr 30, 2026, 9:50 AM · Recorded Apr 30, 2026, 10:50 AM · Ended Apr 30, 2026, 10:50 AM

Test suite v1 — Nutrition · 17bc604b897e…

13.9Blended scoreTests 373Models 1

Passed52

Failed321

Pass rate13.9%

Duration3590.0s

Categories7

Models1

Speed avg135.5 t/s

Speed TTFT15969ms

Est. cost (run)$0.40

Tokens (Σ results)17.7k / 90.6k

Submitted byBitslix

Run summary

Generated Apr 30, 2026, 10:50 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: z-ai/glm-5.1
Total tests: 373
Categories included: coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Run mode: Full benchmark (no --limit, --category, or --level specified)
Fail-fast behavior: Disabled (fail_fast: false)
Results truncated: true (some test outcomes may be missing from this payload)

Performance Summary

Overall pass rate: 52/373 (13.94%)
Score: 52 out of a maximum possible 375 (13.87%)
Total cost: $0.396
Total tokens: 17,699 prompt + 90,593 completion = 108,292 total
Latency:
- Average: 8.23s
- Median: 4.25s
- Average TTFT (Time to First Token): 10.75s

Category-Level Performance

Debugging: Best-performing category with 14/60 (23.33%) pass rate. Stronger on hard (6/20) and medium (6/20) levels.
Security: 12/60 (20.00%) pass rate. Best on easy (6/20), drops to 3/20 on hard and medium.
Hallucination: 10/60 (16.67%) pass rate. Performance improves with difficulty: easy (2/20), medium (5/20), hard (3/20).
Speed: 8/65 (12.31%) pass rate. Best on easy (5/21), worst on medium (0/21).
Reasoning: 5/62 (8.06%) pass rate. Very weak across all levels, with only 1 pass on hard (1/20) and medium (3/20).
Refactoring: 3/60 (5.00%) pass rate. Consistently low across all levels (1/20 each).
Coding UI: 0/6 (0.00%) pass rate. All tests failed or timed out. Three medium-level tests timed out (thunderstorm_over_city, underwater_coral_reef, vinyl_record_player).

Notable Observations

High TTFT: Average TTFT (10.75s) exceeds average latency (8.23s), suggesting many responses had delays before generation began.
Output speed variability: Extremely high average (1,169.29 tok/s) but likely skewed by a few very fast, short completions. Some categories like hallucination show very high speeds (4,524.83 tok/s) due to short outputs.
Cost distribution:
- Highest cost: coding_ui ($0.088), driven by long outputs (e.g., breakout_game: 8,346 tokens).
- Lowest cost: hallucination ($0.017), due to short responses.
Refactoring hard TTFT outlier: refactoring::hard has an average TTFT of 57.32s, significantly higher than other categories.
Speed medium failure: speed::medium had 0 passes (0/21), indicating poor performance on moderately complex speed tasks.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

z-ai/glm-5.137352/37313.9%8.23s$0.40

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v1 — Nutrition

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)