BLXBench - Run run

Benchmark run

Started May 28, 2026, 5:10 PM · Recorded May 28, 2026, 6:31 PM · Ended May 28, 2026, 6:31 PM

Test suite v2 — Resilience · 045d4510abd0…

75.0Blended scoreTests 459Models 1

Passed276

Failed183

Pass rate60.1%

Duration4894.2s

Categories9

Models1

Speed avg101.1 t/s

Speed TTFT854ms

Cost/strict$0.0065

Strict success96.7%

Score/$14782.87

Failed spend$0.02

P50 task cost$0.0042

P90 task cost$0.01

Est. cost (run)$9.39

Tokens (Σ results)46.0k / 370.3k

Submitted byBitslix

Run summary

Generated May 28, 2026, 6:32 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: anthropic/claude-opus-4.8
Total tests: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full category coverage (no --category, --level, or --limit applied)
Fail-fast: false — all tests executed regardless of failures

Performance Summary

Overall pass rate: 276/459 (60.1%)
Score: 2200.61 out of 3155 (69.7%)
Average latency: 10.18s
Median latency: 8.32s
Average TTFT (Time to First Token): 1.66s
Average output speed: 187.13 tok/s
Total cost: $9.39

Category-Level Performance

High-Performing Categories

Coding: 58/60 passed (96.7%), score 257/261 (98.5%) — excelled in all sub-levels, achieving 100% pass rate on easy and hard.
Cost: 29/30 passed (96.7%), score 139/150 (92.7%) — strong on efficiency and correctness.
Security: 46/60 passed (76.7%), score 413/512 (80.7%) — solid performance across difficulty levels.
Speed: 48/60 passed (80.0%), score 152/180 (84.4%) — high pass rate with good throughput.
Hallucination: 42/60 passed (70.0%), score 303/360 (84.2%) — notably strong on hard (80% pass) and medium (85% pass).
UI: 8/9 passed (88.9%), score 7.61/9 (84.5%) — only one failure on a hard test.

Lower-Performing Categories

Reasoning: 24/60 passed (40.0%), score 430/541 (79.5%) — struggles despite high output speed (792.49 tok/s), especially on easy and medium levels.
Refactoring: 8/60 passed (13.3%), score 251/541 (46.4%) — weakest category; poor performance across all difficulty levels.
Debugging: 13/60 passed (21.7%), score 248/601 (41.3%) — very low pass rate despite moderate cost ($1.20) and latency.

Notable Observations

Latency outliers:
- High TTFT in reasoning (4.98s avg) due to complex constraint reasoning.
- refactoring had the highest cost ($2.54) despite low pass rate, indicating verbose or inefficient outputs.
Cost distribution:
- refactoring was most expensive ($2.54), followed by coding ($0.66) and debugging ($1.20).
- cost category itself was cheapest ($0.19), as expected.
Output speed variation:
- reasoning had extremely high output speed (792.49 tok/s) on average, likely due to long-form explanations.
- security was slowest (74.84 tok/s), possibly due to cautious or detailed responses.
Failures in debugging:
- Model failed most debugging tests (47/60 failed), especially on medium and hard race conditions, closures, and async issues.
- Only passed db-transaction-isolation-v2, memoize-reference-equality-v2, object-mutation-v2, rate-limiter-v2, and recursion-base-case-v2.

Conclusion

anthropic/claude-opus-4.8 shows strong coding, cost-efficiency, and hallucination resistance, but struggles significantly with debugging and refactoring tasks, and has inconsistent reasoning accuracy despite fast output generation. The model is reliable for implementation and validation but less so for diagnosing or restructuring complex code.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

anthropic/claude-opus-4.8459276/45969.7%10.18s$9.39

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)