Benchmark run
Started May 10, 2026, 1:29 PM · Recorded May 10, 2026, 2:48 PM · Ended May 10, 2026, 2:48 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 2:48 PM · qwen/qwen3-235b-a22b-2507
deepseek/deepseek-v4-flash459fail_fast=false)226/459 (49.2%)2234.12 out of 3137 (71.2% of max)$0.05779.98s5.88s1.25s52.44 tok/s average55/60 passed (91.7%), 96.1% score — fastest TTFT (0.71s) and high output speed (60.96 tok/s)29/30 passed (96.7%), 87.3% score — excellent accuracy and low cost48/60 passed (80%), 92.3% score — strong in easy and hard levels, weakest in medium (65% pass)6/9 passed (66.7%), 79.1% score — limited data, but moderate performance9/60 passed (15%), 61.0% score — very low pass rate despite decent score percentage11/60 passed (18.3%), 58.2% score — poor pass rate across all levels14/60 passed (23.3%), 69.1% score — struggles with constraint and runtime correctness logic25/60 passed (41.7%), 71.9% score — inconsistent, with some complex failures (e.g., prototype pollution parsing error)29/60 (48.3%), 72.5% scorefetch-timeout, node-crypto) but fails on common misconceptions (e.g., array-flat, promise-resolve)halluc-edge-string-truncate took 90.73s with very low output speed (1.16 tok/s)debugging tests (prototype-pollution-check-v2, prototype-pollution-merge-v2) failed with error: Spread syntax requires ...iterable[Symbol.iterator] to be a function — likely model generated invalid JS syntax.debugging::deep-clone-v2: 68.95s latency (TTFT: 38.64s)halluc-edge-string-truncate: 90.73s latency (TTFT: 0.92s, but output speed only 1.16 tok/s)cost had lowest total cost ($0.0011) despite high pass raterefactoring ($0.0136) and reasoning ($0.0085) due to longer outputs and retriesThe deepseek/deepseek-v4-flash model performs well in coding, cost, and speed tasks, showing fast response times and high accuracy. It struggles significantly in refactoring, security, and reasoning, indicating weaknesses in deeper program analysis and logical deduction. Hallucination resistance is moderate, with frequent false claims about API behavior. Latency is generally acceptable, though a few pathological cases (e.g., string truncation, deep clone) show severe performance degradation.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)