BLXBench - Run run

Benchmark run

Started Jun 9, 2026, 6:36 PM · Recorded Jun 9, 2026, 8:08 PM · Ended Jun 9, 2026, 8:08 PM

Test suite v2 — Resilience · 045d4510abd0…

70.3Blended scoreTests 459Models 1

Passed259

Failed200

Pass rate56.4%

Duration5502.1s

Categories9

Models1

Speed avg103.0 t/s

Speed TTFT3519ms

Cost/strict$0.02

Strict success96.7%

Score/$5734.03

Failed spend$0.05

P50 task cost$0.01

P90 task cost$0.04

Est. cost (run)$18.57

Tokens (Σ results)46.0k / 366.0k

Submitted byBitslix

Run summary

Generated Jun 9, 2026, 8:09 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: anthropic/claude-fable-5
Total tests: 459
Categories covered: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category filter), though results are truncated ("results_truncated": true)
Fail-fast mode: Disabled (fail_fast: false)

Performance Summary

Overall pass rate: 259/459 (56.4%)
Score: 2033.76 / 3155 (64.5%)
Average latency: 11.5s
Median latency: 9.6s
Average TTFT (Time to First Token): 4.03s
Average output speed: 187.1 tok/s
Total cost: $18.57
Total tokens: 45,972 prompt + 365,991 completion = 411,963 total

Category-Level Performance

High-Performing Categories

Coding: Exceptional performance with 58/60 passed (96.7% pass rate) and 98.5% score. Strong across all difficulty levels, achieving 100% on easy and hard subcategories.
Cost: Also strong at 29/30 passed (96.7% pass rate) and 88% score. Perfect on medium and easy levels.
Speed: 48/60 passed (80% pass rate), 85.6% score. Performance drops slightly on hard tasks (75% pass rate).
UI: 8/9 passed (88.9% pass rate), 86.2% score. Only one failure on a hard test.

Low-Performing Categories

Refactoring: Very weak at 6/60 passed (10% pass rate), 42.5% score. Performance is consistently poor across all difficulties, worst on hard (5% pass rate).
Security: 21/60 passed (35% pass rate), 40.6% score. Struggles across the board, especially on easy tasks (35% pass rate despite lower difficulty).
Reasoning: 18/60 passed (30% pass rate), 75.4% score. Despite low pass rate, scores higher due to partial credit on complex constraint-based tasks.
Debugging: 28/60 passed (46.7% pass rate), 58.2% score. Performance degrades with difficulty: 75% (easy), 40% (medium), 25% (hard).

Hallucination

Pass rate: 43/60 (71.7%)
Score: 79.7%
Mixed results: excels on hard behavioral/API claims (100% on some), but fails on edge cases like halluc-edge-rate-limiter and halluc-bug-label-statement.

Notable Observations

Cost-heavy tasks: Refactoring incurred the highest cost ($5.18) despite lowest pass rate, due to long outputs (103,790 completion tokens).
Latency outliers: Some reasoning and coding hard tasks exceeded 20s latency (e.g., coding-hard-diff-objects at 20.39s).
Output speed variance: Extremely high on some reasoning tasks (e.g., 19848 tok/s on reason-constraint-rollout-window) due to rapid final bursts.
TTFT consistency: Generally stable around 3–5s across categories, with reasoning having the highest average (5.03s).
Failures under complexity: Model struggles with concurrency, race conditions, and state management bugs (e.g., debug-concurrent-map-delete-v2, debug-microtask-race-v2 failed).

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

anthropic/claude-fable-5459259/45964.5%11.50s$18.57

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)