BLXBench - Run run

Benchmark run

Started Jun 4, 2026, 3:59 PM · Recorded Jun 4, 2026, 5:15 PM · Ended Jun 4, 2026, 5:15 PM

Test suite v2 — Resilience · 045d4510abd0…

74.7Blended scoreTests 459Models 1

Passed216

Failed243

Pass rate47.1%

Duration4503.1s

Categories9

Models1

Speed avg58.2 t/s

Speed TTFT1069ms

Cost/strict$0.0002

Strict success96.7%

Score/$394776.32

Failed spend$0.0006

P50 task cost$0.0002

P90 task cost$0.0004

Est. cost (run)$0.35

Tokens (Σ results)32.0k / 214.7k

Submitted byBitslix

Run summary

Generated Jun 4, 2026, 5:15 PM · qwen/qwen3-235b-a22b-2507

Scope

This BLXBench run evaluated 1 model: qwen/qwen3.7-plus, across 459 tests spanning multiple categories including coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, and ui. The run was not limited to a subset of tests ("limit": null) and was not fail-fast, meaning all tests were attempted even after failures. The results are truncated, indicating not all test outcomes may be included in this summary.

Performance

The model achieved an overall pass rate of 216/459 (47.06%) and a score of 2205.0982 out of a maximum possible 3155 (69.89%). The median latency was 4.81s, while the average latency was significantly higher at 9.44s, suggesting some long-tail inference times. The model demonstrated strong output speed at 55.07 tok/s on average, with a total cost of $0.352743.

Category-Level Patterns

Coding (90.0% pass rate): The strongest category, particularly excelling in coding::easy (100% pass rate) and coding::hard (90% pass rate). The model reliably generated correct code across difficulty levels.
Speed (91.67% pass rate): Performed exceptionally well, achieving 100% pass rate on speed::hard tests, indicating robustness under performance constraints.
Cost (96.67% pass rate): Near-perfect performance, with 100% pass rate on cost::easy and cost::medium tests, showing accurate cost-aware reasoning.
Debugging (35.0% pass rate): Performance declined significantly, especially on easy and hard levels (35% and 20% pass rates respectively). The model struggled with identifying subtle bugs.
Reasoning (18.33% pass rate): Very low pass rate, particularly poor on reasoning::medium (5% pass rate). The model failed to correctly apply complex logical or constraint-based reasoning.
Refactoring (3.33% pass rate): Extremely poor performance across all levels, with only 2 tests passed. The model failed to produce effective refactored code.
Security (13.33% pass rate): Very low pass rate, with only 2 passes in easy and medium levels. The model showed minimal ability to identify security vulnerabilities.
Hallucination (46.67% pass rate): Mixed results. The model correctly identified non-existent APIs (halluc-api-fetch-timeout, halluc-api-intl-segmenter) but failed many "claims" and "edge case" tests, indicating a tendency to invent incorrect behavior.

Notable Observations

High Cost in Reasoning: Despite low accuracy, the reasoning category incurred the highest total cost ($0.132447), likely due to long response lengths required for complex justifications.
Latency Variance: The large gap between median latency (4.81s) and average latency (9.44s) suggests some tests (e.g., long reasoning chains) took considerably longer than most.
Category Strength Disparity: The model excels at generative tasks (coding, speed, cost) but is weak at analytical tasks (debugging, reasoning, security, refactoring), highlighting a key limitation.
Truncated Results: The results_truncated flag means this summary may not reflect the complete run, potentially missing additional failures or edge cases.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

qwen/qwen3.7-plus459216/45969.9%9.44s$0.35

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)