BLXBench - Run run

Benchmark run

Started May 1, 2026, 8:27 AM · Recorded May 1, 2026, 9:24 AM · Ended May 1, 2026, 9:24 AM

Test suite v1 — Nutrition · 17bc604b897e…

85.4Blended scoreTests 373Models 1

Passed321

Failed52

Pass rate86.1%

Duration3422.4s

Categories7

Models1

Speed avg2256.0 t/s

Speed TTFT10358ms

Est. cost (run)$0.69

Tokens (Σ results)62.6k / 264.8k

Submitted byBitslix

Run summary

Generated May 1, 2026, 9:25 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: x-ai/grok-4.3
Total tests run: 373
Categories included: coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Run mode: Full benchmark (no --limit, --category, or --level filters applied)
Fail-fast behavior: Disabled (fail_fast: false)
Results completeness: Partial results shown (results_truncated: true), but aggregate summary is complete.

Performance Summary

Overall pass rate: 321/373 (86.06%)
Total score: 320.53 out of 375 (85.48%)
Average latency: 9.09s
Median latency: 5.21s
Average time-to-first-token (TTFT): 8.20s
Average output speed: 9941.53 tok/s
Total cost: $0.69

Category-Level Performance

Strong Performers

Hallucination: Perfect pass rate (60/60, 100%) and full score (60/60). Extremely fast output speed (23970.27 tok/s) and low TTFT (2.44s).
Reasoning: High pass rate (60/62, 96.77%) and strong score (60/64, 93.75%). Fast output (25771.19 tok/s) and low TTFT (3.32s).
Security: High pass rate (57/60, 95%) and score (57/60, 95%). Moderate output speed (407.22 tok/s), TTFT (4.65s).
Coding UI: Perfect pass rate (6/6, 100%) and high score (5.535/6, 92.25%). High cost ($0.15) and long TTFT (88.44s) due to complex UI generation tasks.

Moderate Performers

Speed: Pass rate 53/65 (81.54%). Includes speed::medium with very low pass rate (10/21, 47.62%), dragging down category performance.
Debugging: Pass rate 44/60 (73.33%). Performance varies by difficulty: hard (16/20) > easy (15/20) > medium (13/20).
Refactoring: Pass rate 41/60 (68.33%). Consistently lower across difficulty levels, with easy (12/20) being weakest.

Notable Observations

Best latency performance: reasoning::easy (3.19s avg TTFT) and hallucination::easy (2.09s avg TTFT).
Highest output speed: hallucination::hard (40630.25 tok/s), indicating rapid factual denial responses.
Costliest category: coding_ui ($0.15) due to high token output in UI generation.
Largest failure cluster: debugging::medium (7/20 failed), including several bugfix and logic errors.
Lowest scoring subcategory: speed::medium (10/21 passed, 47.62%).
No skipped tests: All 373 tests were executed.

Failure Highlights

Debugging failures: Multiple failures in easy and medium levels, including debugging_easy_02_fix_append, debugging_easy_04_fix_div_zero_guard, and debugging_medium_13_bugfix (very high latency: 92.30s).
Reasoning failure: One failure in reasoning_medium_01_weighted_average, despite strong performance elsewhere.
Speed issues: speed::medium had 11 failures, suggesting challenges with mid-complexity timing-sensitive tasks.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

x-ai/grok-4.3373321/37385.5%9.09s$0.69

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v1 — Nutrition

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)