BLXBench - Our Tests

Test Catalog

BLXBench evaluates models against a fixed, versioned set of JSON fixtures. In the repo, fixtures live under:

v1: packages/benchmark-core/tests/
v2: packages/benchmark-core/suites/v2/tests/

The web test catalog mirrors these trees and lets you switch suite versions from the Suite version filter.

The optional roblox category vendors Roblox OpenGameEval fixtures into the same catalog, but executes them through Roblox’s hosted OpenCloud evaluation API.

Categories

Fixture folders (categories) in the suite include:

Category	Focus
speed	Latency-sensitive tasks where concise, correct output matters
security	Safer code changes, vulnerability awareness, refusal behavior
reasoning	Arithmetic, structured steps, logical problems
debugging	Minimal patches and edge-heavy bug fixes
refactoring	Rewrites that must preserve behavior
hallucination	Grounded answers when context is missing or adversarial
coding	Executable JavaScript function tasks scored by hidden test cases
ui	Single-file HTML/UI artifacts with Playwright render and judge validation
coding_ui	HTML (and similar) artifacts; optional Playwright render stage
roblox	Roblox OpenGameEval Lua/game tasks; special category, excluded from Overall

roblox is opt-in: default all-category runs skip it, while explicit Roblox runs show up in reports and web breakdowns. It does not affect Overall score, rank, trends, or best-run logic because Roblox currently supports only openai, claude, and gemini in its hosted eval API.

Running roblox does not require local Python, uv, or Roblox Studio. It requires a Roblox account and an OpenCloud API key with studio-evaluations:create; configure that key as OPEN_GAME_EVAL_API_KEY through .env, shell exports, or ~/.blxbench/config.json → env.

Difficulty Levels

Each fixture has a difficulty level:

Level	Description
Easy	Lighter prompts / scoring
Medium	Representative difficulty
Hard	Stricter scorers or longer tasks

Viewing Tests

Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.

Test Format

Fixtures are JSON with fields such as:

id / file name — Stable identifier
prompt — User input
category — Folder / domain
level — easy | medium | hard (legacy aliases normalized at runtime)
Scorer configuration — How passes are judged (exact match, rubric, hidden executable tests for coding, render+judge for ui / coding_ui, etc.)

Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.

Contributing Tests

A public “submit a fixture” workflow is not available yet. Today, changes go through the main repository: add JSON under packages/benchmark-core/tests/<category>/ and open a pull request. Tests should be deterministic, safe to run unattended, and cheap enough for community runners where possible.

Test Catalog

BLXBench evaluates models against a fixed, versioned set of JSON fixtures. In the repo, fixtures live under:

v1: packages/benchmark-core/tests/
v2: packages/benchmark-core/suites/v2/tests/

The web test catalog mirrors these trees and lets you switch suite versions from the Suite version filter.

The optional roblox category vendors Roblox OpenGameEval fixtures into the same catalog, but executes them through Roblox’s hosted OpenCloud evaluation API.

Difficulty Levels

Each fixture has a difficulty level:

Level	Description
Easy	Lighter prompts / scoring
Medium	Representative difficulty
Hard	Stricter scorers or longer tasks

Viewing Tests

Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.

Test Format

Fixtures are JSON with fields such as:

id / file name — Stable identifier
prompt — User input
category — Folder / domain
level — easy | medium | hard (legacy aliases normalized at runtime)
Scorer configuration — How passes are judged (exact match, rubric, hidden executable tests for coding, render+judge for ui / coding_ui, etc.)

Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.

Our Tests

Test Catalog

Categories

Difficulty Levels

Viewing Tests

Test Format

Contributing Tests

On this page

Our Tests

Test Catalog

Categories

Difficulty Levels

Viewing Tests

Test Format

Contributing Tests

On this page