Benchmark run
Started May 10, 2026, 6:05 PM · Recorded May 10, 2026, 6:30 PM · Ended May 10, 2026, 6:30 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 6:30 PM · qwen/qwen3-235b-a22b-2507
The benchmark run evaluated a single model: ibm-granite/granite-4.1-8b. A total of 459 tests were executed across multiple categories, with no test level or category filtering applied. The run was not limited in scope ("limit": null) and did not use fail-fast mode ("fail_fast": false). The results are truncated ("results_truncated": true), meaning not all test outcomes are included in this payload.
The model achieved an overall pass rate of 199/459 (43.36%) and a score percentage of 67.07% of the maximum possible. The average latency was 2.89s, with a median of 2.27s. Time-to-first-token (TTFT) averaged 0.284s, and output speed averaged 109.79 tok/s. The total cost for the run was $0.0143, with 30723 prompt tokens and 129180 completion tokens used.
91.67% (55/60) and score percentage of 97.22%. All hard tests were passed perfectly.29/30 tests (96.67%) with a score percentage of 90.67%. All hard and medium tests were fully passed.56.67% pass rate (34/60) and 73.33% score. The model struggled with edge cases and API claims but performed well on behavior and bug detection.55% (33/60) and score percentage 64.75%. Performance declined with difficulty: 80% on easy, 50% on medium, and 35% on hard.23.33% (14/60) despite a higher score percentage of 73.38%, indicating partial credit on many tasks. Hard and medium tests were particularly challenging.15% (9/60) and 63.40% score. Performance was consistent across difficulty levels, with no level exceeding 20% pass rate.6/60 passed (10% pass rate) and 44.14% score. Notably, all 20 easy tests failed despite moderate scores, suggesting incorrect but non-zero output.1/9 passed (11.11%), with very low score percentage (21.93%). The model struggled across all difficulty levels.debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) resulted in errors: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", indicating a fundamental failure in code generation.reason-constraint-sla-breach) with extremely high latency (21.7s) and very low output speed (18.95 tok/s), suggesting potential looping or inefficiency.0.267s easy vs 0.935s medium), indicating inconsistent responsiveness on complex prompts.0% pass rate on easy tests is a critical concern, as the model failed basic security checks despite generating plausible outputs (non-zero scores).Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)