BLXBench - Run run

Benchmark run

Started May 10, 2026, 7:53 PM · Recorded May 10, 2026, 8:09 PM · Ended May 10, 2026, 8:09 PM

Test suite v2 — Resilience · 045d4510abd0…

69.4Blended scoreTests 459Models 1

Passed201

Failed258

Pass rate43.8%

Duration969.5s

Categories9

Models1

Speed avg390.9 t/s

Speed TTFT687ms

Cost/strict$0.0002

Strict success96.7%

Score/$516570.62

Failed spend$0.0002

P50 task cost$0.0001

P90 task cost$0.0003

Est. cost (run)$0.23

Tokens (Σ results)27.3k / 150.1k

Submitted byBitslix

Run summary

Generated May 10, 2026, 8:09 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: google/gemini-3.1-flash-lite
Total tests run: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full benchmark (no --limit or --fail-fast)
Results completeness: Partial results shown ("results_truncated": true), but summary aggregates are complete.

Performance Summary

Overall pass rate: 201/459 (43.79%)
Overall score: 2015.13 / 3137 (64.24%)
Total cost: $0.2297
Average latency: 1.72s
Median latency: 1.33s
Average time to first token (TTFT): 0.81s
Average output speed: 504.70 tok/s

Category-Level Performance

High Performers

Speed: 55/60 passed (91.67%), score 175/180 (97.22%)
Cost: 29/30 passed (96.67%), score 132/150 (88.00%)
Coding: 50/60 passed (83.33%), score 246/261 (94.25%)

Low Performers

Refactoring: 0/60 passed (0.00%), score 237/541 (43.81%)
Reasoning: 5/60 passed (8.33%), score 270/541 (49.91%)
UI: 4/9 passed (44.44%), score 3.13/9 (34.77%)

Notable Observations

Strong Category Performance

Coding excels across all difficulty levels, with coding::easy achieving 95% pass rate and 98.88% score.
Cost shows near-perfect performance on easy and medium levels (100% and 90% pass rates respectively).
Speed achieves perfect 100% pass rate on hard level (speed::hard), and 95–100% across all levels.

Weak Category Performance

Refactoring fails all tests (0% pass rate) across all difficulty levels, despite moderate scoring due to partial credit.
Reasoning struggles severely, with only 2/20 passed on easy and 1/20 on medium. The hardest reasoning tasks (reasoning::hard) see only 2 passes.
Debugging has a 40% pass rate overall, with debugging::hard dropping to 25% pass rate.

Hallucination Detection

Hallucination category pass rate: 21/60 (35%)
Model frequently fails to avoid incorrect API claims (e.g., halluc-api-array-flat, halluc-api-generator-return)
Performs better on edge cases like halluc-edge-float-precision and halluc-bug-typeof-null, passing those.

Latency & Efficiency

Fastest categories (lowest TTFT): cost, debugging, hallucination, security, all averaging ~0.65s TTFT.
Slowest category: reasoning (1.44s TTFT), likely due to complex multi-step analysis.
Highest output speed: cost (1374.58 tok/s), likely due to concise, factual responses.
Lowest output speed: debugging (300.69 tok/s), possibly due to verbose diagnostic reasoning.

Cost Analysis

Most expensive category: reasoning ($0.0763), due to long outputs and high token counts (50,918 completion tokens).
Cheapest category: cost ($0.0051), reflecting efficient, short responses.
Total completion tokens: 150,132 (vs 27,309 prompt tokens), indicating verbose model outputs.

Failures of Note

debugging category: Two tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function", suggesting code generation issues.
reasoning category: Nearly all failures involve misapplying constraints (e.g., time windows, policy merges, rate limits), indicating weak logical consistency.
refactoring category: Universal failure suggests model lacks understanding of code structure transformation or intent preservation.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

google/gemini-3.1-flash-lite459201/45964.2%1.72s$0.23

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)