BLXBench - Leaderboard

Overview

The BLXBench leaderboard displays benchmark results for AI models across multiple providers. Models are ranked by their aggregate performance across Overall-eligible BLXBench categories.

The leaderboard ranks models by an aggregate score derived from merged benchmark runs: total earned score divided by total max score across all Overall-eligible tests in that rollup (i.e. a test-weighted percentage, not a fixed 25% weight per category). Categories with more fixtures therefore influence the overall more than sparse ones.

Category Scores

Open a model to see per-category breakdowns (speed, security, reasoning, debugging, refactoring, hallucination, coding_ui, roblox, etc.). Each slice uses the same score / max_score rollup for tests in that category.

roblox is a special category backed by Roblox OpenGameEval. It is visible in filters, run details, and breakdowns, but it is excluded from Overall score, Overall rank, trends, and best-run selection until Roblox exposes an API path that can evaluate every leaderboard model consistently.

Costs

Use the Costs slice to compare estimated USD spend and related usage from submitted runs. This is separate from the quality score.

Filtering Results

Local vs cloud runs

Runs submitted from a local inference stack (LM Studio or Ollama) are identified throughout the leaderboard:

INFRA column — a compact column after COST shows a badge (e.g. LM Studio or Ollama) for every local-inference row. Cloud runs leave this column empty.
Row badge — the same label appears in small type below the Suite tag inside the model cell.
Inspector panel — when you select a local-inference model, the badge appears next to the SELECTED MODEL label and on the same line as the model slug.
Model / run detail pages — a Hardware Snapshot collapsible section shows the CPU, RAM, and GPU (with VRAM) of the machine that ran the benchmark.

Local runs carry provider_mode: local in their report.json and are produced by using --provider lms or --provider oll. They are eligible for the public leaderboard under the same rules as cloud runs (full, unfiltered, no fail-fast).

By Provider (label)

The table derives a short provider label from the model id (the segment before /, e.g. openai in openai/gpt-5.4-mini). Search and filters use that string — it is not the same as blxbench adapter aliases (opr, oai, …).

By Search

Search matches model name or provider substring.

By Category or Difficulty

Narrow the table to a category (fixture domain) and/or difficulty (easy / medium / hard).

Where the UI supports it, rows may include a pointer to each model’s best public run so you can open the underlying run detail for context.

Viewing Individual Results

Click any model to see:

Full test results per category
Individual test cases
Historical runs
Cost analysis

Submitting Results

To submit your own results:

Create a BLXBench account and complete a pass tier that includes leaderboard submission (see Account)
For headless runs, create a BLXBench API key; for the TUI, sign in with /auth login
Run benchmarks with blxbench and pass --submit or set BLXBENCH_SUBMIT=1

See Quick Start for blxbench examples.

Public submission rules

The public POST /api/bench/submit endpoint (used by --submit, BLXBENCH_SUBMIT, and TUI /report submit / s / r) accepts only reports that represent a fair, comparable run on the shared suite. In practice, your report.json is rejected (HTTP 400 with a short message) when any of these apply:

Condition	Why
`limited_run` (`--limit` / `/set limit`)	Per-category caps change how much of the suite ran — not comparable to full runs.
Category, level, or per-category limit filters	The embedded cli.options in the report record narrowed the suite. Full *``** for categories/levels in the TUI is OK.
`exit_early` / fail-fast	The run stopped before the full plan finished.
`executed_tests` ≠ `total_tests`	Incomplete or legacy partial reports.
Duplicate `run_id`	That run was already submitted (HTTP 409).

Filtering is still useful for local experiments, faster feedback, and /report list review — it is only public upload that requires an unfiltered, completed run. A full run that also includes roblox is still eligible; the uploaded Overall fold ignores roblox. When in doubt, check /show in the TUI (no category/level lines means all Overall categories, no limit) and run without fail-fast for a ranked submission.

Weekly submit quota (per model)

Scout, Bencher, and Founder passes cap how many public bench uploads you can make per model id in each ISO week (UTC) (see /pass for tier limits): Scout includes 2, Bencher 5, and Founder 10 submissions per model week. Each accepted report.json increments the count for every model listed in report.summary.models. If a model is at its cap, POST /api/bench/submit returns 429 with code: "BENCH_WEEKLY_LIMIT" and a message naming the model; the response may also include model, used, and limit for scripting.

Hosting policy (operator): some production sites also enforce a manifest allowlist or report integrity verification (cryptographic signing). If you see errors about manifest hash or signature / integrity, your CLI build and environment must match what that deployment documents — those checks are not something you fix by editing report.json by hand.

Interpreting Results

What Makes a Good Score?

A good benchmark score means the model:

Completes tasks correctly
Does so within reasonable time
Handles edge cases well

Limitations

Benchmarks don't cover all use cases
Results vary by model version
Cost considerations are separate from performance

INFRA column — a compact column after COST shows a badge (e.g. LM Studio or Ollama) for every local-inference row. Cloud runs leave this column empty.
Row badge — the same label appears in small type below the Suite tag inside the model cell.
Inspector panel — when you select a local-inference model, the badge appears next to the SELECTED MODEL label and on the same line as the model slug.
Model / run detail pages — a Hardware Snapshot collapsible section shows the CPU, RAM, and GPU (with VRAM) of the machine that ran the benchmark.

Full test results per category
Individual test cases
Historical runs
Cost analysis

Submitting Results

To submit your own results:

Create a BLXBench account and complete a pass tier that includes leaderboard submission (see Account)
For headless runs, create a BLXBench API key; for the TUI, sign in with /auth login
Run benchmarks with blxbench and pass --submit or set BLXBENCH_SUBMIT=1

See Quick Start for blxbench examples.

Public submission rules

Condition	Why
`limited_run` (`--limit` / `/set limit`)	Per-category caps change how much of the suite ran — not comparable to full runs.
Category, level, or per-category limit filters	The embedded cli.options in the report record narrowed the suite. Full *``** for categories/levels in the TUI is OK.
`exit_early` / fail-fast	The run stopped before the full plan finished.
`executed_tests` ≠ `total_tests`	Incomplete or legacy partial reports.
Duplicate `run_id`	That run was already submitted (HTTP 409).

Completes tasks correctly
Does so within reasonable time
Handles edge cases well

Limitations

Benchmarks don't cover all use cases
Results vary by model version
Cost considerations are separate from performance

Leaderboard

Overview

Understanding Scores

Overall Score

Category Scores

Costs

Filtering Results

Local vs cloud runs

By Provider (label)

By Search

By Category or Difficulty

Viewing Individual Results

Submitting Results

Public submission rules

Weekly submit quota (per model)

Interpreting Results

What Makes a Good Score?

Limitations

On this page

Leaderboard

Overview

Understanding Scores

Overall Score

Category Scores

Costs

Filtering Results

Local vs cloud runs

By Provider (label)

By Search

By Category or Difficulty

Viewing Individual Results

Submitting Results

Public submission rules

Weekly submit quota (per model)

Interpreting Results

What Makes a Good Score?

Limitations

On this page