BLXBench - Run run

Benchmark run

Started May 19, 2026, 6:29 PM · Recorded May 19, 2026, 7:24 PM · Ended May 19, 2026, 7:24 PM

Test suite v2 — Resilience · 045d4510abd0…

44.8Blended scoreTests 459Models 1

Passed108

Failed351

Pass rate23.5%

Duration3257.9s

Categories9

Models1

Speed avg329.8 t/s

Speed TTFT2112ms

Cost/strict$0.02

Strict success43.3%

Score/$8001.38

Failed spend$0.13

P50 task cost$0.0069

P90 task cost$0.0090

Est. cost (run)$5.43

Tokens (Σ results)27.4k / 605.0k

Submitted byBitslix

Run summary

Generated May 19, 2026, 7:24 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: google/gemini-3.5-flash
Total tests run: 459
Categories included: All available categories were tested (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category filter applied)
Fail-fast behavior: Disabled (fail_fast=false)
Results completeness: Partial — results_truncated=true, meaning not all test outcomes are included in this payload.

Performance Summary

Overall pass rate: 108/459 (23.5%)
Total score: 1308.69 out of 3155 (41.5%)
Total cost: $5.43
Average latency: 6.66s
Median latency: 4.94s
Average time to first token (TTFT): 2.25s
Average output speed: 338.25 tok/s

Category-Level Performance

✅ Strong Performance

Coding:
- Pass rate: 51/60 (85%)
- Score: 225/261 (86.2%)
- Latency: 2.28s TTFT, 323.1 tok/s output speed
- Subcategory highlights:
  - coding::easy: Perfect pass rate (20/20)
  - coding::medium: 85% pass rate
  - coding::hard: 70% pass rate

⚠️ Moderate Performance

Security:
- Pass rate: 13/60 (21.7%)
- Score: 236/512 (46.1%)
- Strongest in security::medium (40% pass rate)
Reasoning:
- Pass rate: 8/60 (13.3%)
- Score: 273/541 (50.5%) — high score despite low pass rate, suggesting partial credit on complex tasks
Speed:
- Pass rate: 14/60 (23.3%)
- Score: 83/180 (46.1%) — better on harder tasks (35% pass rate on speed::hard)

❌ Weak Performance

Debugging:
- Pass rate: 0/60 (0%)
- Score: 155/601 (25.8%) — all failures, but some scoring suggests partial credit
Hallucination:
- Pass rate: 5/60 (8.3%)
- Score: 115/360 (31.9%) — frequent false claims about APIs and behaviors
Refactoring:
- Pass rate: 2/60 (3.3%)
- Score: 135/541 (24.9%) — very low success rate across all difficulty levels
Cost:
- Pass rate: 13/30 (43.3%) — mixed results, with only 20% pass rate on cost::hard

⚠️ Limited Data

UI: Only 9 tests; pass rate 2/9 (22.2%), all failures on easy and hard subcategories.

Notable Observations

High hallucination rate: Model frequently invents non-existent API behaviors (e.g., halluc-api-*, halluc-doc-*), especially in hard and medium levels.
Strong on coding, weak on reasoning about systems: Excels at code generation but struggles with debugging, system constraints, and cost implications.
Cost inefficiency in refactoring: Despite high token usage (121k completion tokens), pass rate is near zero in refactoring.
Latency variability: Average 6.66s latency with 4.94s median — long-tail delays likely due to complex or failing tasks.
TTFT consistency: Time to first token is stable across categories (~2.2–2.4s), suggesting prompt processing is predictable.

Summary

The google/gemini-3.5-flash model demonstrates strong coding ability but significant weaknesses in debugging, refactoring, and hallucination avoidance. It scores moderately on reasoning and security but fails most advanced system-design tasks. The high cost ($5.43) and partial results suggest this was a large, expensive run with incomplete outcome data. Model is suitable for straightforward code generation but unreliable for system reasoning or safety-critical tasks.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

google/gemini-3.5-flash459108/45941.5%6.66s$5.43

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)