Fixture reference

Our tests

456 fixtures in suite v2 — Resilience

BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. The optional Roblox OpenGameEval category is visible as a special category, but excluded from Overall ranking.

A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.

Levels

Every fixture declares a level. Use easy, medium, or hard in JSON; legacy easy is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).

easy

Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.

medium

Default difficulty: representative prompt length and evaluation strictness for the category.

hard

Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.

Suite version

456 fixtures · 29 models tested

Coding

Implementation-focused coding tasks with structured correctness checks.

60 fixtures

cost

Cost

Cost-aware correctness and efficient API spend per successful task.

30 fixtures

debugging

Debugging

Bug fixes, edge conditions, and minimal patch accuracy.

60 fixtures

hallucination

Hallucination

Grounded answers under adversarial or missing-context prompts.

60 fixtures

reasoning

Reasoning

Arithmetic, symbolic steps, and structured problem solving.

60 fixtures

refactoring

Refactoring

Code transformation while preserving behavior and intent.

60 fixtures

security

Security

Secure code changes, vulnerability recognition, and safe defaults.

60 fixtures

speed

Speed

Throughput and TTFT-focused generation tasks.

60 fixtures

Ui

Single-file HTML visual/UI artifacts with render and preview workflows.

6 fixtures

Matrix

One example fixture per category and level (where defined).

Category	easy	medium	hard
Coding	Coding-Easy-Capitalize-Words	Coding-Medium-Circular-Buffer	Coding-Hard-Astar-Grid
Cost	Cost-Generation-Fibonacci	Cost-Analysis-Closure-Counter	Cost-Analysis-Buggy-Memoize
Debugging	Debug-Array-Sort-Mutation-V2	Debug-Async-ForEach-V2	Debug-Cache-Invalidation-Race-V2
Hallucination	Halluc-Api-Array-Flat	Halluc-Api-Generator-Return	Halluc-Api-Atomics-Wait
Reasoning	Reason-Ce-Even-Number	Reason-Constraint-Batch-Window	Reason-Constraint-Consistency-Latency
Refactoring	Refactor-Array-Push-Loop-Spread	Refactor-Array-Manipulation-Pipeline	Refactor-Auth-Policy-Boundaries
Security	Sec-Cookie-Policy-Validator	Sec-Abac-Rule-Engine	Sec-Abuse-Detection-Rate-Window
Speed	Speed-Cli-Flags	Speed-Alert-Normalization	Speed-Architecture-Brief
Ui	Ui-Easy-Login-Card	Ui-Medium-Admin-User-Table	Ui-Hard-Game-Lobby