BLXBench - Run run

Benchmark run

Started May 10, 2026, 9:53 PM · Recorded May 10, 2026, 10:37 PM · Ended May 10, 2026, 10:37 PM

Test suite v2 — Resilience · 045d4510abd0…

80.2Blended scoreTests 459Models 1

Passed281

Failed178

Pass rate61.2%

Duration2631.9s

Categories9

Models1

Speed avg98.7 t/s

Speed TTFT650ms

Cost/strict$0.0018

Strict success100.0%

Score/$52558.72

Failed spend$0.00

P50 task cost$0.0014

P90 task cost$0.0036

Est. cost (run)$2.63

Tokens (Σ results)29.1k / 186.1k

Submitted byBitslix

Run summary

Generated May 10, 2026, 10:37 PM · qwen/qwen3-235b-a22b-2507

Scope

Models tested: openai/gpt-5.3-codex
Total tests: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full benchmark (no --limit, no --fail-fast)
Results completeness: results_truncated: true — output was truncated; full results may contain more detail.

Performance Summary

The model openai/gpt-5.3-codex achieved an overall pass rate of 281/459 (61.22%) and a score percentage of 77.66%. Average latency was 5.24s, with a median of 3.66s. Total cost for the run was $2.63.

By Category Highlights

Strongest categories:
- coding: 96.67% pass rate, 98.47% score — excels in code generation across all difficulty levels.
- cost: 100% pass rate, 92.67% score — highly accurate in cost-aware coding and optimization.
- speed: 85% pass rate, 89.44% score — performs well on performance-sensitive tasks.
Weakest categories:
- refactoring: 21.67% pass rate, 67.10% score — struggles with code transformation tasks, especially on easy problems.
- reasoning: 35% pass rate, 74.68% score — poor on constraint-based reasoning, particularly hard problems (20% pass rate).
- hallucination: 50% pass rate, 74.44% score — frequent factual errors in API behavior, edge cases, and documentation.

Notable Failures and Errors

Two tests failed with execution errors:
- debug-prototype-pollution-check-v2 — error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function"
- debug-prototype-pollution-merge-v2 — same error
- Both occurred in debugging category, suggesting issues with handling prototype pollution edge cases.
High-latency outliers:
- coding-hard-json-pointer: 12.76s latency
- coding-hard-json-patch: 10.56s latency
- halluc-edge-regex-backtrack: 9.02s latency
High-cost tests:
- coding-hard-json-pointer: $0.0212
- coding-hard-json-patch: $0.0168
- debug-env-flag-string-v2: $0.0095

Category-Level Observations

Coding: Near-perfect on hard and easy tasks (100% pass rate), slight drop on medium (90%). Strong output speed (120.34 tok/s on easy).
Debugging: Moderate pass rate (66.67%), but performance improves with difficulty — hard problems have higher pass rate than easy in some subcategories.
Hallucination: Fails consistently on API and documentation claims (e.g., halluc-api-array-flat, halluc-doc-tc39-pipeline). Struggles with edge and complexity subcategories.
Reasoning: Very low pass rate on hard problems (20%), especially constraint-based (reason-constraint-subscription-migration). High output speed (251.09 tok/s on hard) but poor accuracy.
Refactoring: Surprisingly weak on easy problems (10% pass rate), better on medium and hard — suggests misalignment with expected refactoring patterns.

Cost and Latency

Total cost: $2.63
Total tokens: 29,117 prompt + 186,148 completion = 215,265 total
Average output speed: 107.20 tok/s
TTFT (Time to First Token): Average 1.01s, lowest in coding (0.66s), highest in reasoning (3.32s)

Despite high cost and latency in some tasks, the model demonstrates strong coding and cost-awareness capabilities, but significant weaknesses in reasoning and hallucination resistance.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

openai/gpt-5.3-codex459281/45977.7%5.24s$2.63

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)