BLXBench - Granite 4.1 8b

Model detail

ibm-granite/granite-4.1-8b

67.1Overall scoreOverall rank 22/29Benchmark runs 1Suite v2 — Resilience

Score67.1

Pass rate43.4

Tests199/459

Runs1

Avg latency2.89s

TTFT (Ø)284 ms

Decode (Ø)109.8 tok/s

Leading categoriesSpeed

1·97.2%

Est. cost$0.01

Tokens (Σ)30.7k pr / 129.2k comp

Model summary

Generated May 10, 2026, 7:30 PM · openrouter/owl-alpha

Snapshot

Model: ibm-granite/granite-4.1-8b
Rank: 13/27 v1 — Nutrition · score 64.56% · pass rate 64.88% (242/373); 3/4 v2 — Resilience · score 67.07% · pass rate 43.36% (199/459)
Latency/Speed: TTFT 0.27s + output speed 127.4 tok/s
Cost/Tokens: $0.018, prompt tokens 51979, completion tokens 160287

Strengths

Strong cost efficiency: $0.018 total across 832 tests, with cost category passing 29/30 (96.7%).
Solid speed performance: 102/125 passed in the speed category, with fast TTFT (0.27s avg).
Competitive in v2 — Resilience: ranked 3/4 with 67.07% score, trailing only deepseek/deepseek-v4-flash and x-ai/grok-4.3.

Weaknesses

Weak reasoning: only 36/122 passed (29.5%), the lowest-performing major category.
Poor security results: 53/120 passed (44.2%), well below the overall pass rate.
UI category nearly failed: 1/9 passed (11.1%), indicating limited front-end capability.
Low v2 pass rate (`43.36

Score over runs

Overall score % from merged run_models rows (chronological). Only runs that include this model appear as points.

Suite versions (latest run per version)

Overall score % from the most recent run per selected suite version for this model. Different suites use different fixtures and max scores — only interpret comparisons qualitatively.

No score samples for the selected versions.

Category performance

Score % vs pass rate % per category. With 0/1 scorers, both usually line up; with proportional tests, score % reflects partial credit while pass rate counts tests that clear the fixture threshold.

Cost by category

Total estimated spend per scope for this model (bars, left axis) and mean spend per merged result row (line, right axis: total ÷ tests).

Pass rate by difficulty

Pass rate % per difficulty level — complements the score % view above.

Difficulty levels

Speed profile by category

Normalized 0–100 within this model: TTFT (shorter → higher spoke) and decode tok/s (higher → higher spoke). Values come from streamed BLXBench runs merged into overall_ranking.json.

Show

Pass rate by category

Pass rate % per category for this model (distinct from score %, which reflects partial credit).

CategoryRankPassScoreLatencytok/sCost

Coding26/2933/6064.8n/a108.2$0.0015

Ui25/291/921.9n/a115.1$0.0015

Debugging21/2918/6067.2n/a111.6$0.0020

Hallucination16/2934/6073.3n/a109.2$0.0014

Reasoning6/2914/6073.4n/a108.8$0.0016

Refactoring7/299/6063.4n/a105.2$0.0027

Security28/296/6044.1n/a110.7$0.0006

Speed17/2955/6097.2n/a114.9$0.0026

Cost6/2929/3090.7n/a108.3$0.0004

#RunSuiteByTestsRun ΣThis model

v2 — ResilienceCost rank within suite1 run

run_dc7816v2 — Resilience459$0.01$0.01

v1 — NutritionCost rank within suite1 run

run_0e34edv1 — Nutrition373$0.0041$0.0041

Loading model…

Snapshot

Strengths

Weaknesses

Categories9 scopes

Cost per run2 for this model · 61 in overall_ranking · grouped by suite

Snapshot

Strengths

Weaknesses

Categories9 scopes

Cost per run2 for this model · 61 in overall_ranking · grouped by suite