Our Tests
Explore the BLXBench test catalog and understand test categories.
Test Catalog
BLXBench evaluates models against a fixed, versioned set of JSON fixtures under packages/benchmark-core/tests/. Each file defines prompts, scoring, and metadata; the web test catalog mirrors that tree.
Categories
Fixture folders (categories) in the suite include:
| Category | Focus |
|---|---|
| speed | Latency-sensitive tasks where concise, correct output matters |
| security | Safer code changes, vulnerability awareness, refusal behavior |
| reasoning | Arithmetic, structured steps, logical problems |
| debugging | Minimal patches and edge-heavy bug fixes |
| refactoring | Rewrites that must preserve behavior |
| hallucination | Grounded answers when context is missing or adversarial |
| coding_ui | HTML (and similar) artifacts; optional Playwright render stage |
Difficulty Levels
Each fixture has a difficulty level:
| Level | Description |
|---|---|
| Easy | Lighter prompts / scoring |
| Medium | Representative difficulty |
| Hard | Stricter scorers or longer tasks |
Viewing Tests
Browse cases in the test catalog. Detail pages show id, prompts (where exposed), category, level, scorer, and historical pass rates when data exists.
Test Format
Fixtures are JSON with fields such as:
id/ file name — Stable identifierprompt— User inputcategory— Folder / domainlevel—easy|medium|hard(legacy aliases normalized at runtime)- Scorer configuration — How passes are judged (exact match, rubric, render+judge for
coding_ui, etc.)
Exact schema varies by benchmark type; open a fixture in the repo for the authoritative shape.
Contributing Tests
A public “submit a fixture” workflow is not available yet. Today, changes go through the main repository: add JSON under packages/benchmark-core/tests/<category>/ and open a pull request. Tests should be deterministic, safe to run unattended, and cheap enough for community runners where possible.