Benchmark run
Started May 29, 2026, 8:27 PM · Recorded May 29, 2026, 10:01 PM · Ended May 29, 2026, 10:01 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 29, 2026, 10:01 PM · qwen/qwen3-235b-a22b-2507
x-ai/grok-build-0.14599 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, uieasy, medium, hard) were included across categories.--limit or --category restriction). Fail-fast was disabled.234/459 (50.98%)2297.26 out of 3155 (72.81%)$2.0112.09s8.59s0.73s190.86 tok/s91.67% pass rate, 97.22% score — excels across all levels, especially hard (100% pass).90% pass rate, 95.40% score — strong in easy and hard, dips slightly in medium (80%).90% pass rate, 87.33% score — consistent performance, perfect on medium difficulty.8.33% pass rate, 59.89% score — weakest category, fails nearly all easy and hard tests.18.33% pass rate, 72.27% score — struggles across all levels, despite moderate scoring.48.33% pass rate, 70.83% score — inconsistent, particularly weak on easy (30% pass).51.67% pass rate, 73.88% score — moderate performance, but fails many subtle bugs.26.67%, Score: 63.09%easy at 20% pass — indicates issues in identifying security flaws.66.67% (6/9)hard level at only 50% pass (1/2).967,219 completion tokens vs 82,967 prompt tokens.coding-hard-scheduler) took over 78s, contributing to high average latency.reasoning (0.59s), slowest in debugging (0.89s).speed category (217.57 tok/s), lowest in refactoring (200.76 tok/s), though all categories are relatively close.refactoring::hard: 0% pass rate (0/20)reasoning::easy: Only 15% pass rate (3/20)hallucination::easy: 30% pass rate, including multiple score:1 or score:2 resultshalluc-edge-integer-overflow: Scored only 1/6, latency 38s — possible confusion on edge behavior.halluc-nested-merge-claims: Scored 0/6, very short output (52 tokens) — likely failed to engage with task.The x-ai/grok-build-0.1 model demonstrates strong coding and speed optimization skills, with excellent performance on well-defined algorithmic and performance tasks. However, it struggles significantly with refactoring, reasoning under constraints, and avoiding hallucinations, especially on easier prompts. Its high completion token usage suggests verbose outputs, which may impact cost-efficiency in production. Improvements needed in semantic reasoning, code safety, and precision for complex or subtle tasks.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)