Benchmark run
Started Jun 17, 2026, 7:52 PM · Recorded Jun 17, 2026, 10:58 PM · Ended Jun 17, 2026, 10:58 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 17, 2026, 10:58 PM · qwen/qwen3-235b-a22b-2507
z-ai/glm-5.24589 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, uieasy, medium, hard (across categories)limit or category filter applied), but results are truncated — full per-test details not available.fail_fast=false)231/458 (50.4%)2233.9 / 3154 (70.8%)$1.1623.04s19.26s3.20s27.35 tok/s94.6% score (54/60 passed)
easy (100% score) and hard (94.0% score).92.2% score (52/60 passed)
95% on hard.84.7% score (27/30 passed)
medium (100% pass rate).77.5% score (37/60 passed)
easy (40% pass rate), improves on hard (85% pass rate).64.1% score (17/60 passed)
easy (20% pass), declines on hard (30% pass).69.7% score (28/60 passed)
45–50% pass), with easy slightly better.63.2% score (7/60 passed)
11.7%), though score is higher due to partial credit.hard (20% pass) than easy (5% pass).59.9% score (7/60 passed)
11.7%), worst in easy (5% pass).23.7% score (2/8 passed)
2 passed, 1 skipped. Fails all easy tests (0/1), passes 1/2 on hard.ui ($0.29) — due to large output (65,720 completion tokens).cost ($0.026) — efficient despite 30 tests.ui or reasoning, given high token counts (e.g., 875 tokens in debugging cost $0.00359).security (1.92s avg TTFT)coding (4.10s avg TTFT)refactoring (42.94 tok/s)hallucination (17.61 tok/s)reasoning and refactoring have near-total failure on easy tasks.ui::easy completely failed (0/1).halluc-api-array-flat, halluc-api-json-parse-date, halluc-api-promises, etc., suggesting overconfidence in non-existent or incorrect API behaviors.db-transaction-isolation-v2 but fails basic ones like object-reference-v2.The z-ai/glm-5.2 model performs well in coding and speed tasks, showing strong algorithmic and performance-awareness capabilities. It is cost-efficient and handles security and debugging at a moderate level. However, it struggles significantly with reasoning, refactoring, and UI tasks, and shows pronounced hallucination tendencies on easy API and edge-case claims. The model also exhibits inconsistent difficulty scaling, sometimes performing better on hard than easy tasks, which may indicate prompt sensitivity or overthinking on simpler problems.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)