BLXBench - Run run

Benchmark run

Started May 30, 2026, 9:04 PM · Recorded May 30, 2026, 9:53 PM · Ended May 30, 2026, 9:53 PM

Test suite v2 — Resilience · 045d4510abd0…

53.4Blended scoreTests 459Models 1

Passed175

Failed284

Pass rate38.1%

Duration2970.9s

Categories9

Models1

Speed avg258.2 t/s

Speed TTFT631ms

Cost/strict$0.0014

Strict success63.3%

Score/$74556.34

Failed spend$0.01

P50 task cost$0.0009

P90 task cost$0.0012

Est. cost (run)$0.74

Tokens (Σ results)33.2k / 641.1k

Submitted byBitslix

Run summary

Generated May 30, 2026, 9:53 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: stepfun/step-3.7-flash
Total tests: 459
Categories: All categories were included (no filtering via category or level CLI options)
Run mode: Full run (not fail_fast)
Results: Truncated (only first few outcomes shown in results_compact)

Performance

Pass rate: 175/459 (38.1%)
Score: 1742.55 / 3155 (55.2%)
Latency:
- Average: 6.14s
- Median: 4.28s
- Average TTFT (Time to First Token): 0.68s
Output speed: 257.26 tok/s (average)
Total cost: $0.736

Category-level Patterns

Strongest Performances

Hallucination: 68.3% pass rate, 75.8% score — best in both metrics.
Debugging: 48.3% pass rate, 62.4% score — solid reasoning on bugs.
Security: 48.3% pass rate, 73.0% score — good detection of vulnerabilities.
Cost: 63.3% pass rate, 66% score — effective at identifying inefficient code.

Weakest Performances

Refactoring: Only 3.3% pass rate (2/60), 17.7% score — severely struggles with code improvement tasks.
UI: 0% pass rate (0/9), 6.1% score — completely fails UI-related reasoning.
Coding: 23.3% pass rate (14/60), 25.7% score — poor on general programming tasks.
Reasoning: 23.3% pass rate (14/60), 67.1% score — low pass rate despite moderate scoring.

Notable Latency

Security has the highest average TTFT at 1.24s, likely due to complex analysis.
Coding::hard has the highest cost per test ($0.088 for 20 tests), reflecting longer, more token-intensive tasks.

Notable Failures & Observations

Coding: Failed all hard difficulty tests (0/20), including algorithmic challenges like astar-grid, dijkstra, and segment-tree.
Refactoring: Only passed 1 test in each of easy, medium, and hard levels — consistent poor performance.
UI: Failed all 9 tests across all difficulties — no successful outputs.
Reasoning: Despite a moderate 67.1% score, pass rate is only 23.3%, suggesting partial credit was common but full correctness rare.
Cost Generation: In the cost category, the model often generates correct code but fails short fixes (e.g., cost-short-fix-sum-of-n, cost-short-fix-factorial-base-case failed despite simple logic).
Debugging: Mixed results — passed complex issues like debug-db-transaction-isolation-v2 and debug-timezone-dst-bucket-v2, but failed basic ones like debug-missing-await-v2 and debug-env-flag-string-v2.

Summary

The stepfun/step-3.7-flash model shows strongest capability in detecting hallucinations and security flaws, with decent performance in debugging and cost-awareness. However, it struggles severely with coding, refactoring, and UI tasks, and fails to reliably perform simple code fixes. While it generates responses quickly (0.68s TTFT), its low pass rate in core programming categories limits practical utility. Total cost of $0.74 is moderate for this scale of testing.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

stepfun/step-3.7-flash459175/45955.2%6.14s$0.74

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)