Leaderboard
How to read and interpret the BLXBench leaderboard.
Overview
The BLXBench leaderboard displays benchmark results for AI models across multiple providers. Models are ranked by their aggregate performance across Overall-eligible BLXBench categories.
Understanding Scores
Overall Score
The leaderboard ranks models by an aggregate score derived from merged benchmark runs: total earned score divided by total max score across all Overall-eligible tests in that rollup (i.e. a test-weighted percentage, not a fixed 25% weight per category). Categories with more fixtures therefore influence the overall more than sparse ones.
Category Scores
Open a model to see per-category breakdowns (speed, security, reasoning, debugging, refactoring, hallucination, coding_ui, roblox, etc.). Each slice uses the same score / max_score rollup for tests in that category.
roblox is a special category backed by Roblox OpenGameEval. It is visible in filters, run details, and breakdowns, but it is excluded from Overall score, Overall rank, trends, and best-run selection until Roblox exposes an API path that can evaluate every leaderboard model consistently.
Costs
Use the Costs slice to compare estimated USD spend and related usage from submitted runs. This is separate from the quality score.
Filtering Results
Local vs cloud runs
Runs submitted from a local inference stack (LM Studio or Ollama) are identified throughout the leaderboard:
- INFRA column — a compact column after COST shows a badge (e.g.
LM StudioorOllama) for every local-inference row. Cloud runs leave this column empty. - Row badge — the same label appears in small type below the Suite tag inside the model cell.
- Inspector panel — when you select a local-inference model, the badge appears next to the SELECTED MODEL label and on the same line as the model slug.
- Model / run detail pages — a Hardware Snapshot collapsible section shows the CPU, RAM, and GPU (with VRAM) of the machine that ran the benchmark.
Local runs carry provider_mode: local in their report.json and are produced by using --provider lms or --provider oll. They are eligible for the public leaderboard under the same rules as cloud runs (full, unfiltered, no fail-fast).
By Provider (label)
The table derives a short provider label from the model id (the segment before /, e.g. openai in openai/gpt-5.4-mini). Search and filters use that string — it is not the same as blxbench adapter aliases (opr, oai, …).
By Search
Search matches model name or provider substring.
By Category or Difficulty
Narrow the table to a category (fixture domain) and/or difficulty (easy / medium / hard).
Where the UI supports it, rows may include a pointer to each model’s best public run so you can open the underlying run detail for context.
Viewing Individual Results
Click any model to see:
- Full test results per category
- Individual test cases
- Historical runs
- Cost analysis
Submitting Results
To submit your own results:
- Create a BLXBench account and complete a pass tier that includes leaderboard submission (see Account)
- For headless runs, create a BLXBench API key; for the TUI, sign in with
/auth login - Run benchmarks with blxbench and pass
--submitor setBLXBENCH_SUBMIT=1
See Quick Start for blxbench examples.
Public submission rules
The public POST /api/bench/submit endpoint (used by --submit, BLXBENCH_SUBMIT, and TUI /report submit / s / r) accepts only reports that represent a fair, comparable run on the shared suite. In practice, your report.json is rejected (HTTP 400 with a short message) when any of these apply:
| Condition | Why |
|---|---|
limited_run (--limit / /set limit) | Per-category caps change how much of the suite ran — not comparable to full runs. |
| Category, level, or per-category limit filters | The embedded cli.options in the report record narrowed the suite. Full * for categories/levels in the TUI is OK. |
exit_early / fail-fast | The run stopped before the full plan finished. |
executed_tests ≠ total_tests | Incomplete or legacy partial reports. |
Duplicate run_id | That run was already submitted (HTTP 409). |
Filtering is still useful for local experiments, faster feedback, and /report list review — it is only public upload that requires an unfiltered, completed run. A full run that also includes roblox is still eligible; the uploaded Overall fold ignores roblox. When in doubt, check /show in the TUI (no category/level lines means all Overall categories, no limit) and run without fail-fast for a ranked submission.
Weekly submit quota (per model)
Scout, Bencher, and Founder passes cap how many public bench uploads you can make per model id in each ISO week (UTC) (see /pass for tier limits): Scout includes 2, Bencher 5, and Founder 10 submissions per model week. Each accepted report.json increments the count for every model listed in report.summary.models. If a model is at its cap, POST /api/bench/submit returns 429 with code: "BENCH_WEEKLY_LIMIT" and a message naming the model; the response may also include model, used, and limit for scripting.
Hosting policy (operator): some production sites also enforce a manifest allowlist or report integrity verification (cryptographic signing). If you see errors about manifest hash or signature / integrity, your CLI build and environment must match what that deployment documents — those checks are not something you fix by editing report.json by hand.
Interpreting Results
What Makes a Good Score?
A good benchmark score means the model:
- Completes tasks correctly
- Does so within reasonable time
- Handles edge cases well
Limitations
- Benchmarks don't cover all use cases
- Results vary by model version
- Cost considerations are separate from performance