BLXBenchBLXBench UI
blxbench

Benchmark

Levels

Misc

DocsDownload blxbenchOur TestsPassSponsor / Partnership
DocsDownload blxbenchOur TestsPassSponsor / Partnership
  1. Home
  2. Our Tests
blxbench

Fixture reference

Our tests

372 fixtures in the public suite

BLXBench runs a fixed, versioned set of JSON fixtures. Each test has a category (domain), a difficulty level, a prompt, and an automatic scorer. Models are run against this suite; the leaderboard shows aggregate quality and cost from your local `results` data.

A path to submit or share new test fixtures is planned; it is not available yet, and we will document it when the workflow is ready.

Levels

Every fixture declares a level. Use easy, medium, or hard in JSON; legacy easy is accepted and treated the same as easy everywhere (blxbench filters, leaderboard, this site).

easy

Lighter tasks: typically shorter contexts or more constrained outputs. Same scoring pipeline, lower cognitive load for the model.

medium

Default difficulty: representative prompt length and evaluation strictness for the category.

hard

Demanding cases: stricter scorers, longer reasoning paths, or adversarial phrasing where applicable.

Categories

Six domains of fixtures, each with its own focus and scorers. Counts are from the current tree under packages/benchmark-core/tests.

coding_ui

Coding Ui

Benchmark tasks from the local fixture set.

6 fixtures

debugging

Debugging

Bug fixes, edge conditions, and minimal patch accuracy.

60 fixtures

hallucination

Hallucination

Grounded answers under adversarial or missing-context prompts.

60 fixtures

reasoning

Reasoning

Arithmetic, symbolic steps, and structured problem solving.

61 fixtures

refactoring

Refactoring

Code transformation while preserving behavior and intent.

60 fixtures

security

Security

Secure code changes, vulnerability recognition, and safe defaults.

60 fixtures

speed

Speed

Latency-sensitive tasks where concise correct output matters.

65 fixtures

Matrix

One example fixture per category and level (where defined).

Categoryeasymediumhard
Coding UiAnalog ClockThunderstorm Over CityBreakout Game
DebuggingFix Greater ThanFix Off By One AverageBugfix
HallucinationNot Stated Api VersionNot Stated Data ResidencyNot Stated
ReasoningJson Output TestWeighted AverageMulti Step
RefactoringProcess Users RefactorExtract And GuardCleanup
SecuritySql ConcatSsrf Url FetchReview
SpeedSummary CloudSummary Incident ResponseSummary

BLXBench

Community driven leaderboardPublic benchmark runner — run in your environment, share results with the community.

© 2026 BLXBench by bitslix.com

ProvenanceAggregated from user runs
Scope6 / 7 / 372
Latestrun_be5c42 / 7 / $0.0019
TermsPrivacy