BLXBenchBLXBench UI
blxbench
BLXBenchBLXBench UI

Benchmark

Suite

Misc

DocsOur TestsPassSponsor / Partnership

Benchmarks

Suite

Misc

DocsOur TestsPassSponsor / Partnership
Updated Jun 12, 08:53 PM·29 models / 40·490 fixtures
blxbench

AI Model Benchmark Leaderboard

Category-aware model rankings from local BLXBench runs, grouped by task domain, difficulty level, pass rate, and latency.

RunByWhenTestsCost
run_f78b01v2 — ResilienceBJun 09, 06:36 PM459$18.57run_f9390dv2 — ResilienceBJun 04, 03:59 PM459$0.35run_510052v2 — ResilienceBJun 01, 01:04 AM459$0.37Show all runs (61)
Top score
Gpt 5.577.9
Executed tests
490 available fixtures13293
Est. API spend
Sum of per-model costs from overall_ranking$64.98
Top decode
Nemotron 3 Super 120b A12b6682.6 tok/s
Categories
Coding / Ui / Debugging / Hallucination / Reasoning / Refactoring / Security / Speed / Cost9
Levels
easy / medium / hard3

Benchmark

OverallAll levelsSuite · v2 — Resilience

RankDetailModelPassScoreLatencytok/sCostInfra
Rank 1OGpt 5.5openai/gpt-5.5Suite v2 — Resilience284/45977.96.06s102.7$6.78
Rank 2OGpt 5.3 Codexopenai/gpt-5.3-codexSuite v2 — Resilience281/45977.75.24s107.2$2.63
Rank 3QQwen3.7 Maxqwen/qwen3.7-maxSuite v2 — Resilience252/45975.34.58s223.3$1.88
4MKimi K2.6moonshotai/kimi-k2.6Suite v2 — Resilience278/45974.85.82s130.2$1.10
5MMinimax M3minimax/minimax-m3Suite v2 — Resilience266/45974.716.80s46.1$0.37
6DDeepseek V4 Prodeepseek/deepseek-v4-proSuite v2 — Resilience242/45973.813.50s48.9$0.82
7XMimo V2.5xiaomi/mimo-v2.5Suite v2 — Resilience254/45973.85.57s120.7$0.45
8XMimo V2.5 Proxiaomi/mimo-v2.5-proSuite v2 — Resilience248/45973.17.42s77.9$0.80
9XGrok Build 0.1x-ai/grok-build-0.1Suite v2 — Resilience234/45972.812.09s190.9$2.01Mandatory thinking
10NNemotron 3 Super 120b A12bnvidia/nemotron-3-super-120b-a12b:freeSuite v2 — Resilience222/45972.311.80s6682.6$0.00
11ZGlm 5.1z-ai/glm-5.1Suite v2 — Resilience235/45972.13.38s163.7$1.03
12DDeepseek V4 Flashdeepseek/deepseek-v4-flashSuite v2 — Resilience226/45971.29.98s52.4$0.06
13MMistral Medium 3 5mistralai/mistral-medium-3-5Suite v2 — Resilience221/45970.32.78s166.0$1.39
14QQwen3.7 Plusqwen/qwen3.7-plusSuite v2 — Resilience216/45969.99.44s55.1$0.35
15AClaude Opus 4.8anthropic/claude-opus-4.8Suite v2 — Resilience276/45969.710.18s187.1$9.39
16QQwen3.6 Flashqwen/qwen3.6-flashSuite v2 — Resilience210/45969.33.59s204.0$0.43
17BCobuddybaidu/cobuddy:freeSuite v2 — Resilience216/45968.721.58s50.6$0.00Mandatory thinking
18AClaude Opus 4.7anthropic/claude-opus-4.7Suite v2 — Resilience276/45668.110.01s95.1$8.89
19NNemotron 3 Nano 30b A3bnvidia/nemotron-3-nano-30b-a3b:freeSuite v2 — Resilience187/45968.01.99s250.1$0.00
20MMistral Small 2603mistralai/mistral-small-2603Suite v2 — Resilience207/45867.82.48s180.2$0.10
21XGrok 4.3x-ai/grok-4.3Suite v2 — Resilience214/45967.65.35s101.0$0.47
22IGranite 4.1 8bibm-granite/granite-4.1-8bSuite v2 — Resilience199/45967.12.89s109.8$0.01
23IRing 2.6 1tinclusionai/ring-2.6-1t:freeSuite v2 — Resilience199/44565.39.87s105.9$0.00Mandatory thinking
24AClaude Fable 5anthropic/claude-fable-5Suite v2 — Resilience259/45964.511.50s187.1$18.57Mandatory thinking
25GGemini 3.1 Flash Litegoogle/gemini-3.1-flash-liteSuite v2 — Resilience201/45964.21.72s504.7$0.23
26NNemotron 3 Nano Omni 30b A3b Reasoningnvidia/nemotron-3-nano-omni-30b-a3b-reasoning:freeSuite v2 — Resilience167/45957.95.00s227.6$0.00
27SStep 3.7 Flashstepfun/step-3.7-flashSuite v2 — Resilience175/45955.26.14s257.3$0.74Mandatory thinking
28MMinimax M2.7minimax/minimax-m2.7Suite v2 — Resilience148/45951.015.91s211.8$1.04Mandatory thinking
29GGemini 3.5 Flashgoogle/gemini-3.5-flashSuite v2 — Resilience108/45941.56.66s338.3$5.43Mandatory thinking
1
O

Selected model

Gpt 5.5

openai/gpt-5.5
Score77.9
Pass rate61.9
Tests284/459v2 — Resilience
Avg latency6.06s
TTFT1091 ms
Decode102.7 tok/s
Slice cost$6.78
Runs1
Coding
98.5
Ui
87.5
Debugging
77.0
Hallucination
78.9
Reasoning
76.2
Refactoring
65.2
Security
74.8
Speed
89.4
Cost
91.3
Run context

Shown metrics are from your best public run (highest score %) for this model. Open that run

Snapshot: May 10, 11:46 PM. Best-run suite: v2 — Resilience.

Starred rows in Category profile are optional benchmark slices (opt-in, e.g. Roblox). n/a there means the best public run did not include that category—not a scored zero.

Open model detail

BLXBench

Community driven leaderboardPublic benchmark runner — run in your environment, share results with the community.

© 2026 BLXBench by bitslix.com

ProvenanceAggregated from user runs
Scope40 / 11 / 490
Latestrun_f78b01 / 459 / $18.57
TermsPrivacy