BLXBench Docs
BLXBench Docs
LeaderboardOur TestsSponsor / PartnershipDocumentationInstallationUpdating blxbenchQuick StartTUIArcadeCommandsHeadless ModeConfigurationLeaderboardOur TestsAccountReport Browser (desktop)AboutFAQSupport

Our Tests

Explore the BLXBench test catalog and understand test categories.

Test Catalog

BLXBench evaluates models against a fixed, versioned set of JSON fixtures. In the repo, fixtures live under:

  • v1: packages/benchmark-core/tests/
  • v2: packages/benchmark-core/suites/v2/tests/

The web test catalog mirrors these trees and lets you switch suite versions from the Suite version filter.

The optional roblox category vendors Roblox OpenGameEval fixtures into the same catalog, but executes them through Roblox’s hosted OpenCloud evaluation API.

Categories

Fixture folders (categories) in the suite include:

CategoryFocus
speedLatency-sensitive tasks where concise, correct output matters
securitySafer code changes, vulnerability awareness, refusal behavior
reasoningArithmetic, structured steps, logical problems
debuggingMinimal patches and edge-heavy bug fixes
refactoringRewrites that must preserve behavior
hallucinationGrounded answers when context is missing or adversarial
codingExecutable JavaScript function tasks scored by hidden test cases
uiSingle-file HTML/UI artifacts with Playwright render and judge validation
coding_uiHTML (and similar) artifacts; optional Playwright render stage
robloxRoblox OpenGameEval Lua/game tasks; special category, excluded from Overall

roblox is opt-in: default all-category runs skip it, while explicit Roblox runs show up in reports and web breakdowns. It does not affect Overall score, rank, trends, or best-run logic because Roblox currently supports only openai, claude, and gemini in its hosted eval API.

Running roblox does not require local Python, uv, or Roblox Studio. It requires a Roblox account and an OpenCloud API key with studio-evaluations:create; configure that key as OPEN_GAME_EVAL_API_KEY through .env, shell exports, or ~/.blxbench/config.json → env.

Difficulty Levels

Each fixture has a difficulty level:

LevelDescription
EasyLighter prompts / scoring
MediumRepresentative difficulty
HardStricter scorers or longer tasks

Viewing Tests

Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.

Test Format

Fixtures are JSON with fields such as:

  • id / file name — Stable identifier
  • prompt — User input
  • category — Folder / domain
  • level — easy | medium | hard (legacy aliases normalized at runtime)
  • Scorer configuration — How passes are judged (exact match, rubric, hidden executable tests for coding, render+judge for ui / coding_ui, etc.)

Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.

Contributing Tests

A public “submit a fixture” workflow is not available yet. Today, changes go through the main repository: add JSON under packages/benchmark-core/tests/<category>/ and open a pull request. Tests should be deterministic, safe to run unattended, and cheap enough for community runners where possible.

Leaderboard

How to read and interpret the BLXBench leaderboard.

Account

Managing your BLXBench account, API keys, billing, and security.

On this page

Test CatalogCategoriesDifficulty LevelsViewing TestsTest FormatContributing Tests