BLXBench - Run run

Benchmark run

Started May 29, 2026, 8:27 PM · Recorded May 29, 2026, 10:01 PM · Ended May 29, 2026, 10:01 PM

Test suite v2 — Resilience · 045d4510abd0…

76.5Blended scoreTests 459Models 1

Passed234

Failed225

Pass rate51.0%

Duration5622.7s

Categories9

Models1

Speed avg217.6 t/s

Speed TTFT680ms

Cost/strict$0.0034

Strict success90.0%

Score/$28737.93

Failed spend$0.02

P50 task cost$0.0027

P90 task cost$0.0052

Est. cost (run)$2.01

Tokens (Σ results)83.0k / 967.2k

Submitted byBitslix

Run summary

Generated May 29, 2026, 10:01 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: x-ai/grok-build-0.1
Total tests run: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Test levels: All levels (easy, medium, hard) were included across categories.
Run mode: Full benchmark (no --limit or --category restriction). Fail-fast was disabled.
Results completeness: The results are truncated — not all test outcomes are included in this payload.

Performance Summary

Overall pass rate: 234/459 (50.98%)
Score: 2297.26 out of 3155 (72.81%)
Total cost: $2.01
Latency:
- Average: 12.09s
- Median: 8.59s
Speed:
- Average TTFT (Time to First Token): 0.73s
- Average output speed: 190.86 tok/s

Category-Level Performance

High-Performing Categories

Speed: 91.67% pass rate, 97.22% score — excels across all levels, especially hard (100% pass).
Coding: 90% pass rate, 95.40% score — strong in easy and hard, dips slightly in medium (80%).
Cost: 90% pass rate, 87.33% score — consistent performance, perfect on medium difficulty.

Low-Performing Categories

Refactoring: Only 8.33% pass rate, 59.89% score — weakest category, fails nearly all easy and hard tests.
Reasoning: 18.33% pass rate, 72.27% score — struggles across all levels, despite moderate scoring.
Hallucination: 48.33% pass rate, 70.83% score — inconsistent, particularly weak on easy (30% pass).
Debugging: 51.67% pass rate, 73.88% score — moderate performance, but fails many subtle bugs.

Security

Pass rate: 26.67%, Score: 63.09%
Weak across all levels, with easy at 20% pass — indicates issues in identifying security flaws.

UI

Pass rate: 66.67% (6/9)
Mixed results, with hard level at only 50% pass (1/2).

Notable Observations

Cost efficiency: Despite moderate pass rates, the model generates long outputs — 967,219 completion tokens vs 82,967 prompt tokens.
High-latency outliers: Some tests (e.g., coding-hard-scheduler) took over 78s, contributing to high average latency.
TTFT variation: Fastest average TTFT in reasoning (0.59s), slowest in debugging (0.89s).
Output speed: Highest in speed category (217.57 tok/s), lowest in refactoring (200.76 tok/s), though all categories are relatively close.

Failures & Errors

Critical failure areas:
- refactoring::hard: 0% pass rate (0/20)
- reasoning::easy: Only 15% pass rate (3/20)
- hallucination::easy: 30% pass rate, including multiple score:1 or score:2 results
Notable hallucination failures:
- halluc-edge-integer-overflow: Scored only 1/6, latency 38s — possible confusion on edge behavior.
- halluc-nested-merge-claims: Scored 0/6, very short output (52 tokens) — likely failed to engage with task.
Debugging blind spots:
- Struggles with race conditions, microtask ordering, and closure bugs — common in async contexts.

Conclusion

The x-ai/grok-build-0.1 model demonstrates strong coding and speed optimization skills, with excellent performance on well-defined algorithmic and performance tasks. However, it struggles significantly with refactoring, reasoning under constraints, and avoiding hallucinations, especially on easier prompts. Its high completion token usage suggests verbose outputs, which may impact cost-efficiency in production. Improvements needed in semantic reasoning, code safety, and precision for complex or subtle tasks.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

x-ai/grok-build-0.1459234/45972.8%12.09s$2.01

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)