BLXBench - Run run

Benchmark run

Started Jun 17, 2026, 7:52 PM · Recorded Jun 17, 2026, 10:58 PM · Ended Jun 17, 2026, 10:58 PM

Test suite v2 — Resilience · 045d4510abd0…

74.1Blended scoreTests 459Models 1

Passed231

Failed228

Pass rate50.3%

Duration11153.1s

Categories9

Models1

Speed avg30.2 t/s

Speed TTFT3995ms

Cost/strict$0.0010

Strict success90.0%

Score/$97017.12

Failed spend$0.0059

P50 task cost$0.0005

P90 task cost$0.0021

Est. cost (run)$1.16

Tokens (Σ results)29.4k / 253.3k

Submitted byBitslix

Run summary

Generated Jun 17, 2026, 10:58 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: z-ai/glm-5.2
Total tests: 458
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Test levels: easy, medium, hard (across categories)
Run mode: Full run (no limit or category filter applied), but results are truncated — full per-test details not available.
Fail-fast: Disabled (fail_fast=false)

Performance Summary

Overall pass rate: 231/458 (50.4%)
Overall score: 2233.9 / 3154 (70.8%)
Total cost: $1.16
Average latency: 23.04s
Median latency: 19.26s
Average TTFT (Time to First Token): 3.20s
Average output speed: 27.35 tok/s

Category-Level Performance

✅ Strong Performers (>85% score)

Coding: 94.6% score (54/60 passed)
- Excels in easy (100% score) and hard (94.0% score).
Speed: 92.2% score (52/60 passed)
- Strong across all levels, peaking at 95% on hard.
Cost: 84.7% score (27/30 passed)
- Solid performance, especially on medium (100% pass rate).

⚠️ Moderate Performers (60–85% score)

Hallucination: 77.5% score (37/60 passed)
- Struggles on easy (40% pass rate), improves on hard (85% pass rate).
Security: 64.1% score (17/60 passed)
- Best on easy (20% pass), declines on hard (30% pass).
Debugging: 69.7% score (28/60 passed)
- Consistent across levels (~45–50% pass), with easy slightly better.

❌ Weak Performers (<65% score)

Reasoning: 63.2% score (7/60 passed)
- Very low pass rate (11.7%), though score is higher due to partial credit.
- Slightly better on hard (20% pass) than easy (5% pass).
Refactoring: 59.9% score (7/60 passed)
- Extremely low pass rate (11.7%), worst in easy (5% pass).
UI: 23.7% score (2/8 passed)
- Only 2 passed, 1 skipped. Fails all easy tests (0/1), passes 1/2 on hard.

Notable Observations

Cost & Efficiency

Highest cost category: ui ($0.29) — due to large output (65,720 completion tokens).
Lowest cost category: cost ($0.026) — efficient despite 30 tests.
Most expensive test: Likely in ui or reasoning, given high token counts (e.g., 875 tokens in debugging cost $0.00359).

Latency & Speed

Fastest category (TTFT): security (1.92s avg TTFT)
Slowest category (TTFT): coding (4.10s avg TTFT)
Fastest output generation: refactoring (42.94 tok/s)
Slowest output generation: hallucination (17.61 tok/s)

Failures & Errors

Critical failure areas:
- reasoning and refactoring have near-total failure on easy tasks.
- ui::easy completely failed (0/1).
High hallucination on easy API claims: Fails halluc-api-array-flat, halluc-api-json-parse-date, halluc-api-promises, etc., suggesting overconfidence in non-existent or incorrect API behaviors.
Debugging: Mixed results — passes complex issues like db-transaction-isolation-v2 but fails basic ones like object-reference-v2.

Conclusion

The z-ai/glm-5.2 model performs well in coding and speed tasks, showing strong algorithmic and performance-awareness capabilities. It is cost-efficient and handles security and debugging at a moderate level. However, it struggles significantly with reasoning, refactoring, and UI tasks, and shows pronounced hallucination tendencies on easy API and edge-case claims. The model also exhibits inconsistent difficulty scaling, sometimes performing better on hard than easy tasks, which may indicate prompt sensitivity or overthinking on simpler problems.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

z-ai/glm-5.2458231/45870.8%23.04s$1.16

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)