Fixture reference
Our tests
456 fixtures in suite v2 — Resilience
BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. The optional Roblox OpenGameEval category is visible as a special category, but excluded from Overall ranking.
A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.
Every fixture declares a level. Use easy, medium, or hard in JSON; legacy easy is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).
easy
Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.
medium
Default difficulty: representative prompt length and evaluation strictness for the category.
hard
Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.
Suite version
456 fixtures · 29 models tested
Fixture domains, each with its own focus and scorers. Counts are from the current tree under packages/benchmark-core/suites/v2/tests.
coding
Coding
Implementation-focused coding tasks with structured correctness checks.
60 fixtures
cost
Cost
Cost-aware correctness and efficient API spend per successful task.
30 fixtures
debugging
Debugging
Bug fixes, edge conditions, and minimal patch accuracy.
60 fixtures
hallucination
Hallucination
Grounded answers under adversarial or missing-context prompts.
60 fixtures
reasoning
Reasoning
Arithmetic, symbolic steps, and structured problem solving.
60 fixtures
refactoring
Refactoring
Code transformation while preserving behavior and intent.
60 fixtures
security
Security
Secure code changes, vulnerability recognition, and safe defaults.
60 fixtures
speed
Speed
Throughput and TTFT-focused generation tasks.
60 fixtures
ui
Ui
Single-file HTML visual/UI artifacts with render and preview workflows.
6 fixtures
Matrix
One example fixture per category and level (where defined).