Benchmark run
Started May 19, 2026, 6:29 PM · Recorded May 19, 2026, 7:24 PM · Ended May 19, 2026, 7:24 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 19, 2026, 7:24 PM · qwen/qwen3-235b-a22b-2507
google/gemini-3.5-flash459coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category filter applied)fail_fast=false)results_truncated=true, meaning not all test outcomes are included in this payload.108/459 (23.5%)1308.69 out of 3155 (41.5%)$5.436.66s4.94s2.25s338.25 tok/s51/60 (85%)225/261 (86.2%)2.28s TTFT, 323.1 tok/s output speedcoding::easy: Perfect pass rate (20/20)coding::medium: 85% pass ratecoding::hard: 70% pass rate13/60 (21.7%)236/512 (46.1%)security::medium (40% pass rate)8/60 (13.3%)273/541 (50.5%) — high score despite low pass rate, suggesting partial credit on complex tasks14/60 (23.3%)83/180 (46.1%) — better on harder tasks (35% pass rate on speed::hard)0/60 (0%)155/601 (25.8%) — all failures, but some scoring suggests partial credit5/60 (8.3%)115/360 (31.9%) — frequent false claims about APIs and behaviors2/60 (3.3%)135/541 (24.9%) — very low success rate across all difficulty levels13/30 (43.3%) — mixed results, with only 20% pass rate on cost::hard9 tests; pass rate 2/9 (22.2%), all failures on easy and hard subcategories.halluc-api-*, halluc-doc-*), especially in hard and medium levels.121k completion tokens), pass rate is near zero in refactoring.6.66s latency with 4.94s median — long-tail delays likely due to complex or failing tasks.~2.2–2.4s), suggesting prompt processing is predictable.The google/gemini-3.5-flash model demonstrates strong coding ability but significant weaknesses in debugging, refactoring, and hallucination avoidance. It scores moderately on reasoning and security but fails most advanced system-design tasks. The high cost ($5.43) and partial results suggest this was a large, expensive run with incomplete outcome data. Model is suitable for straightforward code generation but unreliable for system reasoning or safety-critical tasks.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)