BLXBench - Run run

Benchmark run

Started May 10, 2026, 10:57 PM · Recorded May 10, 2026, 11:46 PM · Ended May 10, 2026, 11:46 PM

Test suite v2 — Resilience · 045d4510abd0…

80.9Blended scoreTests 459Models 1

Passed284

Failed175

Pass rate61.9%

Duration2923.3s

Categories9

Models1

Speed avg98.7 t/s

Speed TTFT1005ms

Cost/strict$0.0039

Strict success96.7%

Score/$24052.11

Failed spend$0.0051

P50 task cost$0.0031

P90 task cost$0.0074

Est. cost (run)$6.78

Tokens (Σ results)29.1k / 223.4k

Submitted byBitslix

Run summary

Generated May 10, 2026, 11:46 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: openai/gpt-5.5
Total tests run: 459
Categories covered: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --fail-fast used)
Results completeness: results_truncated: true — full per-test details are not present in this payload.

Performance Summary

Overall pass rate: 284/459 (61.87%)
Overall score: 2443.87 / 3137 (77.90%)
Total cost: $6.78
Average latency: 6.06s
Median latency: 3.95s
Average time to first token (TTFT): 1.09s
Average output speed: 102.68 tok/s

Category-Level Performance

High-Performing Categories

Coding: 58/60 passed (96.67%), score 257/261 (98.47%) — strongest category.
- Perfect on hard and easy sublevels (100% score).
Speed: 52/60 passed (86.67%), score 161/180 (89.44%)
Cost: 29/30 passed (96.67%), score 137/150 (91.33%)
UI: 9/9 passed (100%), score 7.87/9 (87.49%)

Medium-Performing Categories

Security: 32/60 passed (53.33%), score 383/512 (74.80%)
Debugging: 38/60 passed (63.33%), score 449/583 (77.02%)
- Performance improves with difficulty: hard (75% pass) > medium (65%) > easy (50%).
Hallucination: 34/60 passed (56.67%), score 284/360 (78.89%)
- Mixed results across subtypes (API, behavior, edge cases).

Lower-Performing Categories

Reasoning: 20/60 passed (33.33%), score 412/541 (76.16%)
- Very low pass rate despite moderate scoring — suggests partial credit on many failures.
Refactoring: 12/60 passed (20%), score 353/541 (65.25%)
- Poor performance across all difficulty levels, especially easy (10% pass).

Notable Observations

High cost contributors:
- refactoring was most expensive category: $1.62
- ui surprisingly high cost: $1.48, driven by large completion tokens (49.5k).
Latency outliers:
- Highest average TTFT in reasoning (2.48s), likely due to complex constraint analysis.
- refactoring, security, and debugging have lowest TTFT (~0.78s), suggesting fast starts but not always correct completions.
Output speed variation:
- Fastest: reasoning (148.47 tok/s)
- Slowest: security (67.55 tok/s)
Failures with errors:
- Two debugging tests (prototype-pollution-check-v2, prototype-pollution-merge-v2) failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function" — likely model hallucinated invalid syntax.
Scoring anomaly:
- In debugging::nullish-cache-hit-v2, model scored 7/10 but passed false — indicates partial credit scoring in use.

Conclusion

openai/gpt-5.5 excels in coding, cost optimization, and speed tasks, with near-perfect performance on algorithmic and efficiency challenges. It struggles significantly with reasoning under constraints and code refactoring, and shows inconsistent behavior on hallucination avoidance. The high cost in ui and refactoring suggests verbose or inefficient outputs in those domains.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

openai/gpt-5.5459284/45977.9%6.06s$6.78

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.3

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)