BLXBench - Run run

Benchmark run

Started May 10, 2026, 3:49 PM · Recorded May 10, 2026, 4:00 PM · Ended May 10, 2026, 4:00 PM

Test suite v2 — Resilience · 045d4510abd0…

72.8Blended scoreTests 459Models 1

Passed214

Failed245

Pass rate46.6%

Duration678.9s

Categories9

Models1

Speed avg124.8 t/s

Speed TTFT1154ms

Cost/strict$0.0003

Strict success93.3%

Score/$292070.16

Failed spend$0.0008

P50 task cost$0.0003

P90 task cost$0.0005

Est. cost (run)$0.47

Tokens (Σ results)82.4k / 165.2k

Submitted byBitslix

Run summary

Generated May 10, 2026, 4:01 PM · qwen/qwen3-235b-a22b-2507

Scope

This benchmark run evaluated a single model, x-ai/grok-4.3, across 459 tests spanning multiple categories including coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, and UI. The run was not limited ("limit":null) and did not use fail-fast mode ("fail_fast":false), meaning all tests were executed or reported. The results are truncated ("results_truncated":true), indicating not all individual test outcomes are included in this payload.

Performance

The model achieved an overall pass rate of 214/459 (0.466) and a score percentage of 67.57%. Average latency was 5.35s, with a median of 2.68s. The average time to first token (TTFT) was 1.18s, and the model generated output at an average speed of 101.04 tok/s. The total cost for the run was $0.465, consuming 82,399 prompt tokens and 165,200 completion tokens.

Category-level Patterns

High Performance: The model excelled in cost (93.3% pass rate, 90.7% score) and speed (93.3% pass rate, 97.8% score) categories, demonstrating strong efficiency and correctness in performance-critical tasks.
Strong Coding Ability: Coding was a relative strength with a 85% pass rate and 91.2% score, particularly on easy (100% pass) and medium (70% pass) difficulty levels.
Poor Reasoning & Refactoring: Performance was weakest in reasoning (6.7% pass rate) and refactoring (11.7% pass rate), indicating significant difficulty with complex logical or structural code changes.
Security & Hallucination Challenges: Security had a very low pass rate (10%) and score (48.4%), while hallucination scored 70.6% but passed only 40% of tests, showing a tendency to invent incorrect API behaviors.
UI Strength: Despite only 9 tests, the model achieved a perfect 100% pass rate in UI, with a high score percentage of 85.6%.

Notable Failures and Observations

Critical Errors: Two debugging tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with the error "Spread syntax requires ...iterable[Symbol.iterator] to be a function", suggesting a fundamental issue with handling certain JavaScript patterns.
High-Cost Tests: The most expensive single test was reason-constraint-subscription-migration in reasoning, costing $0.00523 and taking 40.67s with a very slow output speed of 51.3 tok/s.
Latency Outliers: Several reasoning tests exhibited high latency, such as reason-rc-null-pointer-call (41.34s) and reason-constraint-subscription-migration (40.67s), often correlated with high TTFT (e.g., 33.09s for the former).
Hallucination on Edge Cases: The model failed several hallucination edge cases, such as halluc-edge-integer-overflow and halluc-edge-string-truncate, incorrectly claiming non-existent behaviors.
Difficulty Scaling: Performance degraded significantly with difficulty in most categories. For example, in security, pass rate dropped from 15% (medium) and 10% (hard) to 5% (easy), and in refactoring, it was 20% (hard), 10% (medium), and only 5% (easy).

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

x-ai/grok-4.3459214/45967.6%5.35s$0.47

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Yes — resumed from a paused session

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)