BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:56 PM · Recorded May 10, 2026, 10:41 PM · Ended May 10, 2026, 10:41 PM

Test suite v2 — Resilience · 045d4510abd0…

77.5Blended scoreTests 459Models 1

Passed242

Failed217

Pass rate52.7%

Duration6299.0s

Categories9

Models1

Speed avg59.0 t/s

Speed TTFT995ms

Cost/strict$0.0005

Strict success100.0%

Score/$164284.58

Failed spend$0.00

P50 task cost$0.0005

P90 task cost$0.0009

Est. cost (run)$0.82

Tokens (Σ results)28.9k / 221.9k

Submitted byBitslix

Run summary

Generated May 10, 2026, 10:42 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: deepseek/deepseek-v4-pro
Total tests: 459
Categories: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full category coverage, no limiting (--limit not set), not fail-fast
Results: Truncated (only first few test outcomes shown)

Performance

Pass rate: 242/459 (52.7%)
Score: 2315.43 out of 3137 (73.8%)
Latency:
- Average: 13.50s
- Median: 8.91s
- Average TTFT (Time to First Token): 1.34s
Output speed: 48.92 tokens/sec on average
Total cost: $0.816

Category-level Patterns

Strong Performance

Coding: 91.7% pass rate (55/60), score 96.2% — excels across all difficulty levels, achieving 100% pass rate on easy and hard subcategories.
Speed: 88.3% pass rate (53/60), score 91.7% — very high success rate, especially on easy (90%) and hard (90%) levels.
Cost: Perfect 100% pass rate (30/30), score 90% — flawless on all cost-related tasks.

Moderate to Weak Performance

Debugging: 50% pass rate (30/60), score 76.2% — struggles on easy and hard levels (50% and 45% pass respectively).
Hallucination: 50% pass rate (30/60), score 75.8% — inconsistent, particularly on easy (25% pass) and medium (65%) levels.
Reasoning: Only 30% pass rate (18/60), score 73.2% — poor performance across all levels, with 35% (easy), 25% (medium), and 30% (hard).
Refactoring: Very low 15% pass rate (9/60), score 61.9% — weakest category, especially on hard tasks (5% pass).
Security: 15% pass rate (9/60), score 60.4% — poor across the board, with only 10% on easy and 20% on hard.
UI: 88.9% pass rate (8/9), score 82.6% — strong overall, though only 5/6 (83.3%) on medium difficulty.

Notable Failures and Observations

Critical errors in debugging:
- Two tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-prototype-merge-v2) failed with error: Spread syntax requires ...iterable[Symbol.iterator] to be a function — indicates a fundamental misunderstanding or code generation flaw.
High-cost tests:
- Several debugging and reasoning tests exceeded 30s latency (e.g., debug-cache-invalidation-race-v2 at 37.8s, debug-db-transaction-isolation-v2 at 40.8s).
- Highest token count: 1417 tokens in coding-hard-scheduler, costing $0.00469.
Lowest pass rates:
- refactoring::hard: 5% pass (1/20)
- security::easy: 10% pass (2/20)
- hallucination::easy: 25% pass (5/20)
High output speed outliers:
- cost-generation-title-case: 100.5 tokens/sec
- debug-rate-limiter-v2: 84.4 tokens/sec
High TTFT outliers:
- coding-medium-circular-buffer: 4.02s TTFT
- reason-constraint-maintenance-window: 8.90s TTFT — unusually slow time to first token.

Overall, the model performs strongly in coding and cost optimization but shows significant weaknesses in reasoning, refactoring, and security, with some critical failures in debugging tasks involving prototype pollution.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

deepseek/deepseek-v4-pro459242/45973.8%13.50s$0.82

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)