Benchmark run
Started May 21, 2026, 7:22 PM · Recorded May 21, 2026, 7:59 PM · Ended May 21, 2026, 7:59 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 21, 2026, 8:00 PM · qwen/qwen3-235b-a22b-2507
qwen/qwen3.7-max4599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --fail-fast)results_truncated: true — full per-test details are not included in this payload.252/459 (54.9%)2374.74 out of 3155 (75.27%)4.58s2.47s1.27s223.27 tok/s$1.8831,989 prompt + 242,791 completion = 274,780 totalCoding:
54/60 (90%)250/261 (95.79%)$0.127easy (100% pass rate).Cost:
29/30 (96.67%)134/150 (89.33%)$0.0285hard and medium tests passed.Speed:
52/60 (86.67%)164/180 (91.11%)$0.226226.55 tok/s avg), consistent across levels.Debugging:
37/60 (61.67%)482/601 (80.20%)$0.125hard pass rate = 45%.Hallucination:
33/60 (55%)268/360 (74.44%)$0.0885halluc-api-array-flat, halluc-edge-integer-overflow failed).Reasoning:
11/60 (18.33%)378/541 (69.87%)$0.765 (highest of any category)reason-constraint-batch-window: 19.97s) and high token usage (e.g., 2349 tokens in one test).Refactoring:
6/60 (10%)361/541 (66.73%)$0.204Security:
21/60 (35%)330/512 (64.45%)$0.0504easy tests (20% pass rate), better on hard (35%).UI:
9/9 (100%)7.74/9 (85.96%)$0.2679 tests.High-cost categories:
reasoning ($0.765) and ui ($0.267) are disproportionately expensive.reasoning consumed 101,943 completion tokens — 42% of all completion tokens used.Latency outliers:
reasoning tests exceeded 10s latency, with reason-ce-even-number at 14.26s.debugging and hallucination generally fast (1–3s).TTFT efficiency:
cost (1.14s) and coding (1.06s).reasoning (2.11s), contributing to high overall latency.Output speed:
cost (467.79 tok/s) — likely due to short, direct responses.reasoning (133.38 tok/s), consistent with complex, deliberative outputs.Failures in reasoning:
reason-constraint-subscription-migration: 9/9), most reasoning tests failed — suggests inconsistent logic or hallucinated constraints.Hallucination issues:
halluc-api-array-flat, halluc-api-generator-return).halluc-edge-integer-overflow and halluc-edge-string-truncate.qwen/qwen3.7-max performs strongly in coding, cost optimization, and speed, with high accuracy and low latency. It struggles significantly with reasoning and refactoring, where logical consistency and transformation correctness are weak. Hallucination remains an issue in API and edge-case knowledge. The model is cost-effective in most categories except reasoning, where long outputs drive up expense. Improvements needed in complex reasoning, security logic, and factual accuracy for JS/TS APIs.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)