BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:42 PM · Recorded May 10, 2026, 9:12 PM · Ended May 10, 2026, 9:12 PM

Test suite v2 — Resilience · 045d4510abd0…

75.1Blended scoreTests 459Models 1

Passed235

Failed224

Pass rate51.2%

Duration1804.3s

Categories9

Models1

Speed avg172.8 t/s

Speed TTFT288ms

Cost/strict$0.0007

Strict success93.3%

Score/$132948.80

Failed spend$0.0020

P50 task cost$0.0006

P90 task cost$0.0010

Est. cost (run)$1.03

Tokens (Σ results)28.5k / 227.2k

Submitted byBitslix

Run summary

Generated May 10, 2026, 9:12 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: z-ai/glm-5.1
Total tests: 459
Categories covered: All available categories were tested (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit), not fail-fast (fail_fast=false)
Results: Truncated output — full results not shown

Performance Summary

Overall pass rate: 235/459 (51.2%)
Score: 2261.71 / 3137 (72.1%)
Total cost: $1.029
Latency:
- Avg TTFT (Time to First Token): 0.315s
- Avg latency: 3.375s
- Median latency: 2.004s
Output speed: 163.73 tok/s

Category-Level Performance

High-Performing Categories

Speed: 53/60 passed (88.3%), 93.9% score — fastest output (172.81 tok/s)
Cost: 28/30 passed (93.3%), 88.7% score — high accuracy in cost-aware tasks
Coding: 49/60 passed (81.7%), 92.0% score — strong on algorithmic tasks
- Best in easy (100% score) and hard (89.2%)
UI: Despite low pass rate (3/9, 33.3%), $0.267 cost dominates total spend

Low-Performing Categories

Reasoning: Only 11/60 passed (18.3%), 60.4% score — weakest category
Refactoring: 11/60 passed (18.3%), 64.3% score — struggles with code transformation
Hallucination: 28/60 passed (46.7%), 73.3% score — frequent factual errors
Debugging: 33/60 passed (55.0%), 75.8% score — inconsistent on concurrency and race conditions

Notable Observations

High-cost tasks: ui category cost $0.267 (26% of total) despite only 9 tests — due to high token output (e.g., ui::hard tests generated up to 14.9k tokens)
Latency outliers:
- reason-constraint-consistency-latency took 6.69s
- Several debugging tests exceeded 3.5s (e.g., debug-deep-clone-v2: 3.88s)
Failures under complexity:
- reasoning and refactoring show sharp drop in medium/hard levels
- debugging::hard pass rate: 55%, with failures in microtask-race, distributed-lock-expiry
Hallucination patterns:
- Fails on API behavior claims (e.g., halluc-api-array-flat, halluc-api-generator-return)
- Misunderstands edge cases: regex-backtrack, integer-overflow, nan-equality
Two tests failed with runtime error:
- debug-prototype-pollution-check-v2
- debug-prototype-pollution-merge-v2
- Error: Spread syntax requires ...iterable[Symbol.iterator] to be a function

Conclusion

The z-ai/glm-5.1 model performs well in coding, cost, and speed tasks, but struggles significantly with reasoning, refactoring, and hallucination avoidance. It shows strong output speed and low TTFT, but incurs high cost in ui tasks due to verbose generations. Critical weaknesses in debugging concurrency issues and reasoning under constraints suggest limitations in deep program understanding.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

z-ai/glm-5.1459235/45972.1%3.38s$1.03

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)