BLXBenchBLXBench UI
blxbench

Benchmark

Suite

Misc

DocsOur TestsPassSponsor / Partnership
DocsOur TestsPassSponsor / Partnership
  1. Home
  2. Our Tests
blxbench

Fixture reference

Our tests

456 fixtures in suite v2 — Resilience

BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. The optional Roblox OpenGameEval category is visible as a special category, but excluded from Overall ranking.

A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.

Levels

Every fixture declares a level. Use easy, medium, or hard in JSON; legacy easy is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).

easy

Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.

medium

Default difficulty: representative prompt length and evaluation strictness for the category.

hard

Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.

Suite version

456 fixtures · 29 models tested

Categories

Fixture domains, each with its own focus and scorers. Counts are from the current tree under packages/benchmark-core/suites/v2/tests.

coding

Coding

Implementation-focused coding tasks with structured correctness checks.

60 fixtures

cost

Cost

Cost-aware correctness and efficient API spend per successful task.

30 fixtures

debugging

Debugging

Bug fixes, edge conditions, and minimal patch accuracy.

60 fixtures

hallucination

Hallucination

Grounded answers under adversarial or missing-context prompts.

60 fixtures

reasoning

Reasoning

Arithmetic, symbolic steps, and structured problem solving.

60 fixtures

refactoring

Refactoring

Code transformation while preserving behavior and intent.

60 fixtures

security

Security

Secure code changes, vulnerability recognition, and safe defaults.

60 fixtures

speed

Speed

Throughput and TTFT-focused generation tasks.

60 fixtures

ui

Ui

Single-file HTML visual/UI artifacts with render and preview workflows.

6 fixtures

Matrix

One example fixture per category and level (where defined).

Categoryeasymediumhard
CodingCoding-Easy-Capitalize-WordsCoding-Medium-Circular-BufferCoding-Hard-Astar-Grid
CostCost-Generation-FibonacciCost-Analysis-Closure-CounterCost-Analysis-Buggy-Memoize
DebuggingDebug-Array-Sort-Mutation-V2Debug-Async-ForEach-V2Debug-Cache-Invalidation-Race-V2
HallucinationHalluc-Api-Array-FlatHalluc-Api-Generator-ReturnHalluc-Api-Atomics-Wait
ReasoningReason-Ce-Even-NumberReason-Constraint-Batch-WindowReason-Constraint-Consistency-Latency
RefactoringRefactor-Array-Push-Loop-SpreadRefactor-Array-Manipulation-PipelineRefactor-Auth-Policy-Boundaries
SecuritySec-Cookie-Policy-ValidatorSec-Abac-Rule-EngineSec-Abuse-Detection-Rate-Window
SpeedSpeed-Cli-FlagsSpeed-Alert-NormalizationSpeed-Architecture-Brief
UiUi-Easy-Login-CardUi-Medium-Admin-User-TableUi-Hard-Game-Lobby

BLXBench

Community driven leaderboardPublic benchmark runner — run in your environment, share results with the community.

© 2026 BLXBench by bitslix.com

ProvenanceAggregated from user runs
Scope40 / 11 / 490
Latestrun_f78b01 / 459 / $18.57
TermsPrivacy