BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:12 PM · Recorded May 10, 2026, 8:43 PM · Ended May 10, 2026, 8:43 PM

Test suite v2 — Resilience · 045d4510abd0…

73.2Blended scoreTests 459Models 1

Passed210

Failed249

Pass rate45.8%

Duration1834.5s

Categories9

Models1

Speed avg217.4 t/s

Speed TTFT638ms

Cost/strict$0.0003

Strict success96.7%

Score/$366556.60

Failed spend$0.0008

P50 task cost$0.0002

P90 task cost$0.0006

Est. cost (run)$0.43

Tokens (Σ results)31.9k / 285.9k

Submitted byBitslix

Run summary

Generated May 10, 2026, 8:43 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: qwen/qwen3.6-flash
Total tests: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full benchmark (no --limit or --fail-fast triggered early termination)
Results completeness: results_truncated: true — some detailed outputs may be omitted, but aggregate metrics are complete.

Performance Summary

Overall pass rate: 210/459 (45.75%)
Score achieved: 2173.62 out of 3137 (69.29%)
Total cost: $0.432
Latency:
- Avg TTFT (Time to First Token): 0.665s
- Avg latency per test: 3.595s
- Median latency: 1.951s
Output speed: 203.97 tok/s (average)

Category-Level Performance

High-Performing Categories

Speed: 52/60 passed (86.67%), 92.22% score — fastest and most accurate.
Cost: 29/30 passed (96.67%), 90% score — excellent on cost-aware tasks.
Coding: 49/60 passed (81.67%), 90.04% score — strong on algorithmic tasks, especially at easy level (100% score).
UI: 5/9 passed (55.56%), but 62.45% score — limited data, but moderate success.

Low-Performing Categories

Refactoring: 6/60 passed (10%), 58.96% score — weakest category; struggles across all difficulty levels.
Reasoning: 11/60 passed (18.33%), 70.43% score — poor pass rate despite moderate scoring; fails complex constraint reasoning.
Security: 10/60 passed (16.67%), 52.34% score — low pass rate, especially on easy (2/20) and hard (2/20) tasks.
Debugging: 19/60 passed (31.67%), 68.95% score — inconsistent; fails many subtle bugs, especially in concurrency and state management.

Hallucination

Pass rate: 29/60 (48.33%), 72.78% score.
Mixed results: passes some hard API behavior questions (e.g., fetch-timeout, weakref-gc) but fails basic ones (e.g., array-flat, json-parse-date).
Struggles with edge cases and documentation accuracy (e.g., tc39-pipeline, validation-pipe).

Notable Observations

Cost efficiency: Extremely low total cost ($0.43) for 459 tests, with 318,64 prompt and 285,906 completion tokens — indicates efficient token usage.
Latency profile: Fast TTFT (0.66s avg), but long tail in latency (median 1.95s, mean 3.59s) — some complex tasks take significantly longer.
Difficulty trends:
- Easy: High pass rates in coding, cost, speed.
- Hard: Sharp drop in refactoring (1/20), security (2/20), and debugging (5/20).
Failures in reasoning: Despite high max score potential, reasoning had only 11 passes — often generates plausible but incorrect logic under constraints.
Critical errors: Two debugging tests failed with "Spread syntax requires ...iterable[Symbol.iterator] to be a function" — suggests model-generated code with runtime errors.

Conclusion

The qwen/qwen3.6-flash model delivers strong performance in coding, cost optimization, and speed, with fast response times and low cost. However, it struggles significantly with refactoring, security, and complex reasoning, and shows inconsistent behavior in debugging and hallucination avoidance. It is suitable for lightweight, performance-sensitive tasks but may require validation for correctness in complex or safety-critical domains.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

qwen/qwen3.6-flash459210/45969.3%3.59s$0.43

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)