Our Tests
Explore the BLXBench test catalog and understand test categories.
Test Catalog
BLXBench evaluates models against a fixed, versioned set of JSON fixtures. In the repo, fixtures live under:
- v1:
packages/benchmark-core/tests/ - v2:
packages/benchmark-core/suites/v2/tests/
The web test catalog mirrors these trees and lets you switch suite versions from the Suite version filter.
The optional roblox category vendors Roblox OpenGameEval fixtures into the same catalog, but executes them through Roblox’s hosted OpenCloud evaluation API.
Categories
Fixture folders (categories) in the suite include:
| Category | Focus |
|---|---|
| speed | Latency-sensitive tasks where concise, correct output matters |
| security | Safer code changes, vulnerability awareness, refusal behavior |
| reasoning | Arithmetic, structured steps, logical problems |
| debugging | Minimal patches and edge-heavy bug fixes |
| refactoring | Rewrites that must preserve behavior |
| hallucination | Grounded answers when context is missing or adversarial |
| coding | Executable JavaScript function tasks scored by hidden test cases |
| ui | Single-file HTML/UI artifacts with Playwright render and judge validation |
| coding_ui | HTML (and similar) artifacts; optional Playwright render stage |
| roblox | Roblox OpenGameEval Lua/game tasks; special category, excluded from Overall |
roblox is opt-in: default all-category runs skip it, while explicit Roblox runs show up in reports and web breakdowns. It does not affect Overall score, rank, trends, or best-run logic because Roblox currently supports only openai, claude, and gemini in its hosted eval API.
Running roblox does not require local Python, uv, or Roblox Studio. It requires a Roblox account and an OpenCloud API key with studio-evaluations:create; configure that key as OPEN_GAME_EVAL_API_KEY through .env, shell exports, or ~/.blxbench/config.json → env.
Difficulty Levels
Each fixture has a difficulty level:
| Level | Description |
|---|---|
| Easy | Lighter prompts / scoring |
| Medium | Representative difficulty |
| Hard | Stricter scorers or longer tasks |
Viewing Tests
Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.
Test Format
Fixtures are JSON with fields such as:
id/ file name — Stable identifierprompt— User inputcategory— Folder / domainlevel—easy|medium|hard(legacy aliases normalized at runtime)- Scorer configuration — How passes are judged (exact match, rubric, hidden executable tests for
coding, render+judge forui/coding_ui, etc.)
Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.
Contributing Tests
A public “submit a fixture” workflow is not available yet. Today, changes go through the main repository: add JSON under packages/benchmark-core/tests/<category>/ and open a pull request. Tests should be deterministic, safe to run unattended, and cheap enough for community runners where possible.