BLXBench - Gpt 5.5 Pro

Model detail

openai/gpt-5.5-pro

44.5Overall scoreOverall rank —Benchmark runs 2Suite v2 — Resilience

Score44.5

Pass rate44.5

Tests163/366

Runs2

Avg latency11.55s

TTFT (Ø)12517 ms

Decode (Ø)2215.0 tok/s

Leading categoriesSecurity

1·76.7%

Est. cost$7.88

Tokens (Σ)16.2k pr / 41.5k comp

Model summary

Generated May 8, 2026, 12:27 AM · openrouter/owl-alpha

Snapshot

Model: openai/gpt-5.5-pro
Rank: 20/27 + score 22% + pass rate 22% (163/732)
Latency/Speed: TTFT 12.5s + output speed 2215 tok/s
Cost/Tokens: $7.88, prompt tokens 16190, completion tokens 41548

Strengths

Security category leads with 46/120 passed (38%), the strongest area.
Debugging shows moderate reliability at 41/120 (34%).
Speed category contributes 32/130 passes, indicating reasonable responsiveness.

Weaknesses

Hallucination detection is critically weak: only 2/120 passed (2%).
Reasoning and refactoring both underperform at ~18% pass rate.
Overall pass rate of 22% ranks near bottom third of leaderboard.

Top-3 comparison

Trails top model x-ai/grok-4.3 by 63 percentage points in score (85% vs 22%).
Pass rate gap is severe: 86% (Grok) vs 22% (GPT-5.5 Pro).
Even third-place openai/gpt-chat-latest scores 84%, far ahead.

Recommendation

This model is unsuitable for production coding tasks requiring reliability. Its severe hallucination failures and low reasoning scores pose high deployment risk. Consider only for non-critical speed-focused tasks where security is not paramount.

Score over runs

Overall score % from merged run_models rows (chronological). Only runs that include this model appear as points.

Category performance

Score % vs pass rate % per category. With 0/1 scorers, both usually line up; with proportional tests, score % reflects partial credit while pass rate counts tests that clear the fixture threshold.

Cost by category

Total estimated spend per scope for this model (bars, left axis) and mean spend per merged result row (line, right axis: total ÷ tests).

Pass rate by difficulty

Pass rate % per difficulty level — complements the score % view above.

Difficulty levels

Speed profile by category

Normalized 0–100 within this model: TTFT (shorter → higher spoke) and decode tok/s (higher → higher spoke). Values come from streamed BLXBench runs merged into overall_ranking.json.

Show

Pass rate by category

Pass rate % per category for this model (distinct from score %, which reflects partial credit).

CategoryRankPassScoreLatencytok/sCost

Coding UI—0/0n/an/an/a$0.00

Debugging16/2741/6068.3n/a1200.0$1.34

Hallucination21/272/603.3n/a16.3$0.35

Reasoning15/2722/6136.1n/a218.0$0.33

Refactoring18/2720/6033.3n/a1077.3$2.38

Roblox—0/0n/an/an/a$0.00

Security13/2746/6076.7n/a7408.0$1.63

Speed7/2732/6549.2n/a2608.4$1.86

#RunSuiteByTestsRun ΣThis model

v1 — NutritionCost rank within suite2 runs

run_ae5dadv1 — Nutrition732$0.23$0.00 run_c46c08v1 — Nutrition732$8.96$7.88

Loading model…

Snapshot

Strengths

Weaknesses

Top-3 comparison

Recommendation

Categories8 scopes

Cost per run2 for this model · 61 in overall_ranking · grouped by suite

Snapshot

Strengths

Weaknesses

Top-3 comparison

Recommendation

Categories8 scopes

Cost per run2 for this model · 61 in overall_ranking · grouped by suite