BLXBench Docs
BLXBench Docs
LeaderboardOur TestsSponsor / PartnershipDocumentationInstallationQuick StartTUICommandsHeadless ModeConfigurationLeaderboardOur TestsAccountAboutFAQSupport

Leaderboard

How to read and interpret the BLXBench leaderboard.

Overview

The BLXBench leaderboard displays benchmark results for AI models across multiple providers. Models are ranked by their aggregate performance across all test categories.

Understanding Scores

Overall Score

The leaderboard ranks models by an aggregate score derived from merged benchmark runs: total earned score divided by total max score across all tests in that rollup (i.e. a test-weighted percentage, not a fixed 25% weight per category). Categories with more fixtures therefore influence the overall more than sparse ones.

Category Scores

Open a model to see per-category breakdowns (speed, security, reasoning, debugging, refactoring, hallucination, coding_ui, etc.). Each slice uses the same score / max_score rollup for tests in that category.

Costs

Use the Costs slice to compare estimated USD spend and related usage from submitted runs. This is separate from the quality score.

Filtering Results

By Provider (label)

The table derives a short provider label from the model id (the segment before /, e.g. openai in openai/gpt-5.4-mini). Search and filters use that string — it is not the same as blxbench adapter aliases (opr, oai, …).

By Search

Search matches model name or provider substring.

By Category or Difficulty

Narrow the table to a category (fixture domain) and/or difficulty (easy / medium / hard).

Viewing Individual Results

Click any model to see:

  • Full test results per category
  • Individual test cases
  • Historical runs
  • Cost analysis

Submitting Results

To submit your own results:

  1. Create a BLXBench account and complete a pass tier that includes leaderboard submission (see Account)
  2. For headless runs, create a BLXBench API key; for the TUI, sign in with /auth login
  3. Run benchmarks with blxbench and pass --submit or set BLXBENCH_SUBMIT=1

See Quick Start for blxbench examples.

Interpreting Results

What Makes a Good Score?

A good benchmark score means the model:

  • Completes tasks correctly
  • Does so within reasonable time
  • Handles edge cases well

Limitations

  • Benchmarks don't cover all use cases
  • Results vary by model version
  • Cost considerations are separate from performance

Configuration

Configure blxbench via files, environment variables, and flags.

Our Tests

Explore the BLXBench test catalog and understand test categories.

On this page

OverviewUnderstanding ScoresOverall ScoreCategory ScoresCostsFiltering ResultsBy Provider (label)By SearchBy Category or DifficultyViewing Individual ResultsSubmitting ResultsInterpreting ResultsWhat Makes a Good Score?Limitations