Benchmark run
Started May 10, 2026, 8:56 PM · Recorded May 10, 2026, 10:41 PM · Ended May 10, 2026, 10:41 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 10:42 PM · qwen/qwen3-235b-a22b-2507
deepseek/deepseek-v4-pro4599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit not set), not fail-fast242/459 (52.7%)2315.43 out of 3137 (73.8%)13.50s8.91s1.34s48.92 tokens/sec on average$0.81691.7% pass rate (55/60), score 96.2% — excels across all difficulty levels, achieving 100% pass rate on easy and hard subcategories.88.3% pass rate (53/60), score 91.7% — very high success rate, especially on easy (90%) and hard (90%) levels.100% pass rate (30/30), score 90% — flawless on all cost-related tasks.50% pass rate (30/60), score 76.2% — struggles on easy and hard levels (50% and 45% pass respectively).50% pass rate (30/60), score 75.8% — inconsistent, particularly on easy (25% pass) and medium (65%) levels.30% pass rate (18/60), score 73.2% — poor performance across all levels, with 35% (easy), 25% (medium), and 30% (hard).15% pass rate (9/60), score 61.9% — weakest category, especially on hard tasks (5% pass).15% pass rate (9/60), score 60.4% — poor across the board, with only 10% on easy and 20% on hard.88.9% pass rate (8/9), score 82.6% — strong overall, though only 5/6 (83.3%) on medium difficulty.debug-prototype-pollution-check-v2, debug-prototype-pollution-prototype-merge-v2) failed with error: Spread syntax requires ...iterable[Symbol.iterator] to be a function — indicates a fundamental misunderstanding or code generation flaw.30s latency (e.g., debug-cache-invalidation-race-v2 at 37.8s, debug-db-transaction-isolation-v2 at 40.8s).1417 tokens in coding-hard-scheduler, costing $0.00469.refactoring::hard: 5% pass (1/20)security::easy: 10% pass (2/20)hallucination::easy: 25% pass (5/20)cost-generation-title-case: 100.5 tokens/secdebug-rate-limiter-v2: 84.4 tokens/seccoding-medium-circular-buffer: 4.02s TTFTreason-constraint-maintenance-window: 8.90s TTFT — unusually slow time to first token.Overall, the model performs strongly in coding and cost optimization but shows significant weaknesses in reasoning, refactoring, and security, with some critical failures in debugging tasks involving prototype pollution.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)