Benchmark run
Started May 11, 2026, 12:08 AM · Recorded May 11, 2026, 1:20 AM · Ended May 11, 2026, 1:20 AM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 11, 2026, 1:21 AM · qwen/qwen3-235b-a22b-2507
anthropic/claude-opus-4.74569 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --fail-fast triggered)results_truncated: true), but summary aggregates are complete.276/456 (60.53%)2134.22 / 3134 (68.10%)$8.8910.01s8.60s0.98s95.14 tok/s96.67% pass rate and 98.47% score. Performed well across all difficulty levels, achieving 100% on coding::easy.81.67% pass rate, 83.59% score. Balanced performance across difficulty tiers.81.67% pass rate, 87.78% score. Strong output speed (101.76 tok/s) and low TTFT (0.83s).96.67% pass rate, 91.33% score. Nearly perfect on medium and hard subcategories.13.33% pass rate and 36.78% score. Struggled across all difficulties, especially hard (5% pass rate).38.33% pass rate, 45.11% score. Performance dropped sharply with difficulty: 60% (easy), 10% (medium), 45% (hard).31.67% pass rate, 75.60% score. Despite moderate score percentage, actual pass rate is low. Performance inconsistent across constraints.58.33% pass rate, 77.22% score. Mixed results; better on hard (75% pass) than easy (30% pass).6/6 passed (100% pass rate), but 3 tests were skipped. Score: 86.98%.debugging category (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with a runtime error:
"Spread syntax requires ...iterable[Symbol.iterator] to be a function". These appear to be model output or tooling errors rather than logical failures.reasoning, multiple constraint-based tests failed despite moderate scoring, indicating partial correctness but failure to fully satisfy conditions.refactoring ($2.69), due to long outputs and high token counts (107.7k completion tokens).coding-hard-json-patch ($0.039).debugging (avg latency 15.03s), likely due to complex scenarios requiring longer reasoning.coding category (177.28 tok/s avg), particularly on easy tasks (298.93 tok/s).100% easy → 95% hard).60% (easy) to 10% (medium), then partial recovery to 45% (hard).30% pass on easy, 75% on hard — suggesting easier tasks may trigger more overconfident, incorrect responses.Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.3
Resumed run
⏸ Yes — resumed from a paused session
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)