Headless Mode
Running benchmarks in automated environments.
Headless mode allows BLXBench to run in CI/CD pipelines, scripts, and automated workflows. Install the blxbench command via @bitslix/blxbench (see Installation) before running the examples below.
Basic Usage
blxbench --headless --provider opr --models openai/gpt-5.4-miniOmit --headless if the process already has no TTY (typical in CI); blxbench then enters the same headless path automatically.
For --provider lms or oll, the runner uses fixed loopback URLs on 127.0.0.1. Pass --local-inference-port N when your LM Studio or Ollama daemon listens on a non-default TCP port (defaults remain 1234 and 11434).
Reports are written to the user's report directory by default:
- Linux/macOS:
~/.blxbench/reports/ - Windows:
%USERPROFILE%\.blxbench\reports\
Use --save-json PATH for an additional JSON copy, or use the TUI's /set output-dir PATH when running interactively.
Desktop notification when the run finishes
Add --notify to ask the OS for a short hint when the benchmark completes (the same rules as the TUI: success or failure, not an aborted run). You can also set BLXBENCH_NOTIFY=1 or persist desktopNotify in ~/.blxbench/config.json — see Configuration — Desktop notifications. BLXBENCH_NOTIFY=0 forces notifications off for CI.
Multiple models
Pass more than one model ID to run separate benchmark runs (one run_id and one report.json per model). Use --parallel [n] to cap concurrent sub-runs (default in the runner: min(3, number of models)). The global --ratelimit budget applies to all sub-runs together.
With --submit, each report.json is uploaded separately. Quota is enforced per model for the current ISO calendar week (UTC): Scout includes 2, Bencher 5, and Founder 10 submissions per model week. Each distinct model id in report.summary.models consumes one slot for that week, and a multi-model report counts once toward every model it includes. The CLI may still send a shared batch id (quotaGroupId) for correlation on the server; it does not merge quota. Public submit only accepts full runs: no --limit, no category/level filters, and no fail-fast partial runs (exit_early must be false in report.json). The special roblox category may be attached to a full run; it is visible in the report but excluded from Overall. Details: Public submission rules.
Roblox OpenGameEval
roblox is an opt-in category backed by Roblox OpenGameEval rather than the normal chat-completions scorer. Default “all categories” runs exclude it; include roblox explicitly when you want those tests.
export OPEN_GAME_EVAL_API_KEY=...
export OPENAI_API_KEY=...
blxbench --headless --provider opr --models openai/gpt-5.4-mini \
--category coding_ui debugging hallucination reasoning refactoring security speed roblox \
--roblox-llm-name openai \
--roblox-llm-model-version gpt-5No local Python, uv, or Roblox Studio install is required. You do need a Roblox account and an OpenCloud API key with studio-evaluations:create. Roblox currently supports openai, claude, and gemini as OpenGameEval LLM names; OpenRouter models are not exposed through this path until Roblox offers custom provider/base-url support.
Roblox-specific flags:
| Flag | Purpose |
|---|---|
--roblox-adapter rbx | Adapter for https://apis.roblox.com/open-eval-api/v1; reads OPEN_GAME_EVAL_API_KEY. |
--roblox-llm-name openai|claude|gemini | Provider name sent to Roblox OpenGameEval. |
--roblox-llm-model-version VERSION | Model version sent to Roblox. Defaults to the selected BLXBench model id without an OpenRouter-style provider prefix. |
--roblox-max-concurrent N | Max concurrent Roblox jobs per model. Keep low unless Roblox raises your quota. |
--roblox-poll-interval SECONDS | Poll interval for eval records. |
--roblox-timeout SECONDS | Per-job timeout. |
Roblox results use category: "roblox" and include eval metadata such as job id, record URL, place id, and check counts. Secrets are not written to reports.
Integration with CI/CD
GitHub Actions
name: Benchmark
on: [push, pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
- name: Run benchmark
run: |
bun install -g @bitslix/blxbench
blxbench --headless --provider opr --models openai/gpt-5.4-mini
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: blxbench-results
path: ${{ env.HOME }}/.blxbench/reports/GitLab CI
stages:
- benchmark
benchmark:
image: oven/bun:1
script:
- bun install -g @bitslix/blxbench
- blxbench --headless --provider opr --models openai/gpt-5.4-mini
artifacts:
paths:
- $HOME/.blxbench/reports/Exit Codes
| Code | Description |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | Test failure (with --fail-fast) |
Rate Limiting
Use --ratelimit to avoid hitting provider rate limits:
# Default (60 RPM)
blxbench --headless --provider opr --models openai/gpt-5.4-mini --ratelimit
# Custom (30 requests per minute)
blxbench --headless --provider opr --models openai/gpt-5.4-mini --ratelimit 30Output Handling
Save JSON Results
blxbench --headless --provider opr --models openai/gpt-5.4-mini --save-json ./my-results.json--save-json is an extra export. The regular run folder, HTML report, report.json, screenshots, artifacts, and aggregate ranking files still go under ~/.blxbench/reports/ unless you configure another results directory in the TUI.
Capture Output
# Suppress progress output
blxbench --headless --provider opr --models openai/gpt-5.4-mini 2>/dev/null
# Log to file
blxbench --headless --provider opr --models openai/gpt-5.4-mini >> benchmark.log 2>&1Automated Submission
Set environment variables for automatic submission:
export BLXBENCH_API_KEY=your-key
export BLXBENCH_SUBMIT=1
blxbench --headless --provider opr --models openai/gpt-5.4-miniOr use the flag:
blxbench --headless --provider opr --models openai/gpt-5.4-mini --submit --api-key your-keyNon-Interactive Detection
BLXBench automatically detects non-TTY environments and skips the TUI. To force the same behavior in a terminal:
blxbench --headless --provider opr --models openai/gpt-5.4-mini