BLXBench - Run run

Benchmark run

Started May 10, 2026, 6:05 PM · Recorded May 10, 2026, 6:30 PM · Ended May 10, 2026, 6:30 PM

Test suite v2 — Resilience · 045d4510abd0…

69.6Blended scoreTests 459Models 1

Passed199

Failed260

Pass rate43.4%

Duration1538.0s

Categories9

Models1

Speed avg114.9 t/s

Speed TTFT307ms

Cost/strict$0.0000

Strict success96.7%

Score/$6535382.37

Failed spend$0.0000

P50 task cost$0.0000

P90 task cost$0.0000

Est. cost (run)$0.01

Tokens (Σ results)30.7k / 129.2k

Submitted byBitslix

Run summary

Generated May 10, 2026, 6:30 PM · qwen/qwen3-235b-a22b-2507

Scope

The benchmark run evaluated a single model: ibm-granite/granite-4.1-8b. A total of 459 tests were executed across multiple categories, with no test level or category filtering applied. The run was not limited in scope ("limit": null) and did not use fail-fast mode ("fail_fast": false). The results are truncated ("results_truncated": true), meaning not all test outcomes are included in this payload.

Performance

The model achieved an overall pass rate of 199/459 (43.36%) and a score percentage of 67.07% of the maximum possible. The average latency was 2.89s, with a median of 2.27s. Time-to-first-token (TTFT) averaged 0.284s, and output speed averaged 109.79 tok/s. The total cost for the run was $0.0143, with 30723 prompt tokens and 129180 completion tokens used.

Category-Level Patterns

Speed: Strong performance with a pass rate of 91.67% (55/60) and score percentage of 97.22%. All hard tests were passed perfectly.
Cost: Excellent results, passing 29/30 tests (96.67%) with a score percentage of 90.67%. All hard and medium tests were fully passed.
Hallucination: Moderate performance at 56.67% pass rate (34/60) and 73.33% score. The model struggled with edge cases and API claims but performed well on behavior and bug detection.
Coding: Pass rate of 55% (33/60) and score percentage 64.75%. Performance declined with difficulty: 80% on easy, 50% on medium, and 35% on hard.
Reasoning: Very low pass rate of 23.33% (14/60) despite a higher score percentage of 73.38%, indicating partial credit on many tasks. Hard and medium tests were particularly challenging.
Refactoring: Poor pass rate of 15% (9/60) and 63.40% score. Performance was consistent across difficulty levels, with no level exceeding 20% pass rate.
Security: Weakest category with only 6/60 passed (10% pass rate) and 44.14% score. Notably, all 20 easy tests failed despite moderate scores, suggesting incorrect but non-zero output.
UI: Only 1/9 passed (11.11%), with very low score percentage (21.93%). The model struggled across all difficulty levels.

Notable Failures and Observations

Two tests in the debugging category (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) resulted in errors: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", indicating a fundamental failure in code generation.
The reasoning category included a test (reason-constraint-sla-breach) with extremely high latency (21.7s) and very low output speed (18.95 tok/s), suggesting potential looping or inefficiency.
Despite high pass rates in cost and speed, the model showed significant variation in TTFT on UI tests (0.267s easy vs 0.935s medium), indicating inconsistent responsiveness on complex prompts.
The security category’s 0% pass rate on easy tests is a critical concern, as the model failed basic security checks despite generating plausible outputs (non-zero scores).

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

ibm-granite/granite-4.1-8b459199/45967.1%2.89s$0.01

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)