BLXBench - Run run

Benchmark run

Started May 21, 2026, 7:22 PM · Recorded May 21, 2026, 7:59 PM · Ended May 21, 2026, 7:59 PM

Test suite v2 — Resilience · 045d4510abd0…

78.5Blended scoreTests 459Models 1

Passed252

Failed207

Pass rate54.9%

Duration2251.9s

Categories9

Models1

Speed avg226.5 t/s

Speed TTFT1218ms

Cost/strict$0.0010

Strict success96.7%

Score/$93987.35

Failed spend$0.0004

P50 task cost$0.0007

P90 task cost$0.0017

Est. cost (run)$1.88

Tokens (Σ results)32.0k / 242.8k

Submitted byBitslix

Run summary

Generated May 21, 2026, 8:00 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: qwen/qwen3.7-max
Total tests: 459
Categories evaluated: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --fail-fast)
Results completeness: results_truncated: true — full per-test details are not included in this payload.

Performance Summary

Overall pass rate: 252/459 (54.9%)
Total score: 2374.74 out of 3155 (75.27%)
Average latency: 4.58s
Median latency: 2.47s
Average TTFT (Time to First Token): 1.27s
Average output speed: 223.27 tok/s
Total cost: $1.88
Tokens used: 31,989 prompt + 242,791 completion = 274,780 total

Category-Level Performance

High-Performing Categories

Coding:
- Pass rate: 54/60 (90%)
- Score: 250/261 (95.79%)
- Cost: $0.127
- Strong across all difficulty levels, especially easy (100% pass rate).
Cost:
- Pass rate: 29/30 (96.67%)
- Score: 134/150 (89.33%)
- Cost: $0.0285
- Near-perfect performance; all hard and medium tests passed.
Speed:
- Pass rate: 52/60 (86.67%)
- Score: 164/180 (91.11%)
- Cost: $0.226
- High output speed (226.55 tok/s avg), consistent across levels.

Medium-Performing Categories

Debugging:
- Pass rate: 37/60 (61.67%)
- Score: 482/601 (80.20%)
- Cost: $0.125
- Performance drops with difficulty: hard pass rate = 45%.
Hallucination:
- Pass rate: 33/60 (55%)
- Score: 268/360 (74.44%)
- Cost: $0.0885
- Struggles with API and edge-case claims (e.g., halluc-api-array-flat, halluc-edge-integer-overflow failed).

Low-Performing Categories

Reasoning:
- Pass rate: 11/60 (18.33%)
- Score: 378/541 (69.87%)
- Cost: $0.765 (highest of any category)
- Very low pass rate despite moderate score — indicates partial credit on many failures.
- Long latencies (e.g., reason-constraint-batch-window: 19.97s) and high token usage (e.g., 2349 tokens in one test).
Refactoring:
- Pass rate: 6/60 (10%)
- Score: 361/541 (66.73%)
- Cost: $0.204
- Poor pass rate across all levels; likely struggles with code transformation logic.
Security:
- Pass rate: 21/60 (35%)
- Score: 330/512 (64.45%)
- Cost: $0.0504
- Weak on easy tests (20% pass rate), better on hard (35%).
UI:
- Pass rate: 9/9 (100%)
- Score: 7.74/9 (85.96%)
- Cost: $0.267
- Fully passed, but relatively high cost for only 9 tests.

Notable Observations

High-cost categories:
- reasoning ($0.765) and ui ($0.267) are disproportionately expensive.
- reasoning consumed 101,943 completion tokens — 42% of all completion tokens used.
Latency outliers:
- Several reasoning tests exceeded 10s latency, with reason-ce-even-number at 14.26s.
- debugging and hallucination generally fast (1–3s).
TTFT efficiency:
- Fastest average TTFT in cost (1.14s) and coding (1.06s).
- Slowest in reasoning (2.11s), contributing to high overall latency.
Output speed:
- Highest in cost (467.79 tok/s) — likely due to short, direct responses.
- Lowest in reasoning (133.38 tok/s), consistent with complex, deliberative outputs.
Failures in reasoning:
- Despite high scores in some tests (e.g., reason-constraint-subscription-migration: 9/9), most reasoning tests failed — suggests inconsistent logic or hallucinated constraints.
Hallucination issues:
- Fails on factual API knowledge (e.g., halluc-api-array-flat, halluc-api-generator-return).
- Also fails on edge behaviors like halluc-edge-integer-overflow and halluc-edge-string-truncate.

Conclusion

qwen/qwen3.7-max performs strongly in coding, cost optimization, and speed, with high accuracy and low latency. It struggles significantly with reasoning and refactoring, where logical consistency and transformation correctness are weak. Hallucination remains an issue in API and edge-case knowledge. The model is cost-effective in most categories except reasoning, where long outputs drive up expense. Improvements needed in complex reasoning, security logic, and factual accuracy for JS/TS APIs.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

qwen/qwen3.7-max459252/45975.3%4.58s$1.88

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)