BLXBench - Run run

Benchmark run

Started May 20, 2026, 7:35 PM · Recorded May 20, 2026, 9:00 PM · Ended May 20, 2026, 9:00 PM

Test suite v2 — Resilience · 045d4510abd0…

76.4Blended scoreTests 459Models 1

Passed235

Failed224

Pass rate51.2%

Duration5125.0s

Categories9

Models1

Speed avg251.0 t/s

Speed TTFT1193ms

Cost/strict$0.0030

Strict success100.0%

Score/$30590.05

Failed spend$0.00

P50 task cost$0.0024

P90 task cost$0.0043

Est. cost (run)$1.97

Tokens (Σ results)82.8k / 900.1k

Submitted byBitslix

Run summary

Generated May 20, 2026, 9:01 PM · qwen/qwen3-235b-a22b-2507

Scope

Single model tested: x-ai/grok-build-0.1
Total tests run: 459
Categories included: All categories were tested (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category filter applied)
Fail-fast behavior: Disabled (fail_fast: false)
Results completeness: The results_truncated: true flag indicates the full output was truncated, but aggregate metrics are complete.

Performance Summary

Overall pass rate: 235/459 (51.2%)
Score achieved: 2289.88 out of 3155 (72.58%)
Average latency: 10.61s
Median latency: 7.67s
Average time to first token (TTFT): 1.34s
Average output speed: 217.81 tok/s
Total cost: $1.97
Total tokens processed: 82,848 prompt + 900,085 completion = 982,933 tokens

Category-Level Performance

High-Performing Categories

Speed: Exceptional performance with 55/60 passed (91.7% pass rate) and 97.22% score. Strong across all difficulty levels, achieving 100% on hard tests.
Cost: Perfect pass rate (30/30, 100%) and high score (92.67%). All sub-levels achieved 92–94% scores.
Coding: Strong overall with 52/60 passed (86.7% pass rate) and 93.49% score. Performance degrades slightly with difficulty but remains high even on hard tasks (90% pass rate).
UI: High pass rate (7/9, 77.8%) and solid score (65.36%), though limited to only 9 tests.

Low-Performing Categories

Refactoring: Very poor performance with only 5/60 passed (8.3% pass rate) and 60.07% score. Struggles across all difficulty levels, with pass rates of 10% (easy), 10% (medium), and 5% (hard).
Security: Low pass rate (11/60, 18.3%) and score (61.33%). Performance is consistent across levels but remains weak, with 10% pass rate on easy, 20% on medium, and 25% on hard.
Reasoning: Very low pass rate (15/60, 25%) and 71.16% score. Performance improves slightly with difficulty: 25% (easy), 20% (medium), 30% (hard).
Debugging: Mixed results with 28/60 passed (46.7% pass rate) and 74.71% score. Performance is consistent across difficulty levels (45–50% pass rate).
Hallucination: Moderate pass rate (32/60, 53.3%) and 70.28% score. Performance varies by subcategory, with hard tests (65% pass) outperforming easy (45%) and medium (50%).

Notable Observations

Latency outliers: The model exhibits high average latency (10.61s) with a median of 7.67s, indicating a long tail of slow responses. The debugging::hard category has the highest average TTFT at 2.32s.
Cost concentration: Despite only 60 tests, debugging incurred the highest cost ($0.188) due to high completion token usage (92,814 tokens).
Extreme cost per failure: Several hallucination test failures (halluc-nested-merge-claims, halluc-parser-output-claims, halluc-bug-type-coercion) show cost_usd: 0.0495 with very low token counts (47–52 tokens), suggesting potential error or reporting anomaly.
Strong easy/hard dichotomy: The model excels at speed and cost tasks regardless of difficulty but struggles with refactoring and security across all levels.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

x-ai/grok-build-0.1459235/45972.6%10.61s$1.97

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)