BLXBench - Run run

Benchmark run

Started May 11, 2026, 12:08 AM · Recorded May 11, 2026, 1:20 AM · Ended May 11, 2026, 1:20 AM

Test suite v2 — Resilience · 045d4510abd0…

73.1Blended scoreTests 459Models 1

Passed276

Failed183

Pass rate60.1%

Duration4348.3s

Categories9

Models1

Speed avg101.8 t/s

Speed TTFT826ms

Cost/strict$0.0056

Strict success96.7%

Score/$16885.86

Failed spend$0.0030

P50 task cost$0.0040

P90 task cost$0.01

Est. cost (run)$8.89

Tokens (Σ results)46.5k / 350.0k

Submitted byBitslix

Run summary

Generated May 11, 2026, 1:21 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: anthropic/claude-opus-4.7
Total tests run: 456
Categories covered: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --fail-fast triggered)
Results completeness: Partially truncated (results_truncated: true), but summary aggregates are complete.

Performance Summary

Overall pass rate: 276/456 (60.53%)
Overall score: 2134.22 / 3134 (68.10%)
Total cost: $8.89
Average latency: 10.01s
Median latency: 8.60s
Average time to first token (TTFT): 0.98s
Average output speed: 95.14 tok/s

Category-Level Performance

High-Performing Categories

Coding: Exceptionally strong with 96.67% pass rate and 98.47% score. Performed well across all difficulty levels, achieving 100% on coding::easy.
Security: 81.67% pass rate, 83.59% score. Balanced performance across difficulty tiers.
Speed: 81.67% pass rate, 87.78% score. Strong output speed (101.76 tok/s) and low TTFT (0.83s).
Cost: 96.67% pass rate, 91.33% score. Nearly perfect on medium and hard subcategories.

Low-Performing Categories

Refactoring: Very low 13.33% pass rate and 36.78% score. Struggled across all difficulties, especially hard (5% pass rate).
Debugging: 38.33% pass rate, 45.11% score. Performance dropped sharply with difficulty: 60% (easy), 10% (medium), 45% (hard).
Reasoning: 31.67% pass rate, 75.60% score. Despite moderate score percentage, actual pass rate is low. Performance inconsistent across constraints.
Hallucination: 58.33% pass rate, 77.22% score. Mixed results; better on hard (75% pass) than easy (30% pass).

UI

UI: 6/6 passed (100% pass rate), but 3 tests were skipped. Score: 86.98%.

Notable Observations

Failures and Errors

Two tests in the debugging category (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with a runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function". These appear to be model output or tooling errors rather than logical failures.
In reasoning, multiple constraint-based tests failed despite moderate scoring, indicating partial correctness but failure to fully satisfy conditions.

Cost and Latency

Highest cost category: refactoring ($2.69), due to long outputs and high token counts (107.7k completion tokens).
Most expensive single test: coding-hard-json-patch ($0.039).
Slowest category: debugging (avg latency 15.03s), likely due to complex scenarios requiring longer reasoning.
Fastest output: coding category (177.28 tok/s avg), particularly on easy tasks (298.93 tok/s).

Difficulty Trends

Coding: Performance degrades slightly with difficulty but remains high (100% easy → 95% hard).
Debugging: Severe drop from 60% (easy) to 10% (medium), then partial recovery to 45% (hard).
Hallucination: Inverse trend — 30% pass on easy, 75% on hard — suggesting easier tasks may trigger more overconfident, incorrect responses.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

anthropic/claude-opus-4.7456276/45668.1%10.01s$8.89

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.3

Resumed run

Yes — resumed from a paused session

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)