Benchmark run
Started Jun 4, 2026, 3:59 PM · Recorded Jun 4, 2026, 5:15 PM · Ended Jun 4, 2026, 5:15 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 4, 2026, 5:15 PM · qwen/qwen3-235b-a22b-2507
This BLXBench run evaluated 1 model: qwen/qwen3.7-plus, across 459 tests spanning multiple categories including coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, and ui. The run was not limited to a subset of tests ("limit": null) and was not fail-fast, meaning all tests were attempted even after failures. The results are truncated, indicating not all test outcomes may be included in this summary.
The model achieved an overall pass rate of 216/459 (47.06%) and a score of 2205.0982 out of a maximum possible 3155 (69.89%). The median latency was 4.81s, while the average latency was significantly higher at 9.44s, suggesting some long-tail inference times. The model demonstrated strong output speed at 55.07 tok/s on average, with a total cost of $0.352743.
90.0% pass rate): The strongest category, particularly excelling in coding::easy (100% pass rate) and coding::hard (90% pass rate). The model reliably generated correct code across difficulty levels.91.67% pass rate): Performed exceptionally well, achieving 100% pass rate on speed::hard tests, indicating robustness under performance constraints.96.67% pass rate): Near-perfect performance, with 100% pass rate on cost::easy and cost::medium tests, showing accurate cost-aware reasoning.35.0% pass rate): Performance declined significantly, especially on easy and hard levels (35% and 20% pass rates respectively). The model struggled with identifying subtle bugs.18.33% pass rate): Very low pass rate, particularly poor on reasoning::medium (5% pass rate). The model failed to correctly apply complex logical or constraint-based reasoning.3.33% pass rate): Extremely poor performance across all levels, with only 2 tests passed. The model failed to produce effective refactored code.13.33% pass rate): Very low pass rate, with only 2 passes in easy and medium levels. The model showed minimal ability to identify security vulnerabilities.46.67% pass rate): Mixed results. The model correctly identified non-existent APIs (halluc-api-fetch-timeout, halluc-api-intl-segmenter) but failed many "claims" and "edge case" tests, indicating a tendency to invent incorrect behavior.reasoning category incurred the highest total cost ($0.132447), likely due to long response lengths required for complex justifications.4.81s) and average latency (9.44s) suggests some tests (e.g., long reasoning chains) took considerably longer than most.results_truncated flag means this summary may not reflect the complete run, potentially missing additional failures or edge cases.Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)