BLXBench - Run run

Benchmark run

Started May 10, 2026, 1:29 PM · Recorded May 10, 2026, 2:48 PM · Ended May 10, 2026, 2:48 PM

Test suite v2 — Resilience · 045d4510abd0…

75.4Blended scoreTests 459Models 1

Passed226

Failed233

Pass rate49.2%

Duration4750.1s

Categories9

Models1

Speed avg61.0 t/s

Speed TTFT714ms

Cost/strict$0.0000

Strict success96.7%

Score/$2296042.62

Failed spend$0.0001

P50 task cost$0.0000

P90 task cost$0.0001

Est. cost (run)$0.06

Tokens (Σ results)28.9k / 194.1k

Submitted byBitslix

Run summary

Generated May 10, 2026, 2:48 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: deepseek/deepseek-v4-flash
Total tests: 459
Categories: All categories were included (no category filter applied)
Run mode: Full run (not limited or partial), though results are truncated in output
Fail-fast mode: Disabled (fail_fast=false)

Performance Summary

Pass rate: 226/459 (49.2%)
Score: 2234.12 out of 3137 (71.2% of max)
Total cost: $0.0577
Latency:
- Average: 9.98s
- Median: 5.88s
- Average TTFT (Time to First Token): 1.25s
Output speed: 52.44 tok/s average

Category-Level Performance

High-Performing Categories

Speed: 55/60 passed (91.7%), 96.1% score — fastest TTFT (0.71s) and high output speed (60.96 tok/s)
Cost: 29/30 passed (96.7%), 87.3% score — excellent accuracy and low cost
Coding: 48/60 passed (80%), 92.3% score — strong in easy and hard levels, weakest in medium (65% pass)
UI: 6/9 passed (66.7%), 79.1% score — limited data, but moderate performance

Low-Performing Categories

Refactoring: 9/60 passed (15%), 61.0% score — very low pass rate despite decent score percentage
Security: 11/60 passed (18.3%), 58.2% score — poor pass rate across all levels
Reasoning: 14/60 passed (23.3%), 69.1% score — struggles with constraint and runtime correctness logic
Debugging: 25/60 passed (41.7%), 71.9% score — inconsistent, with some complex failures (e.g., prototype pollution parsing error)

Hallucination

Pass rate: 29/60 (48.3%), 72.5% score
Mixed results: passes on real APIs (e.g., fetch-timeout, node-crypto) but fails on common misconceptions (e.g., array-flat, promise-resolve)
Notable failure: halluc-edge-string-truncate took 90.73s with very low output speed (1.16 tok/s)

Notable Observations

Failures due to parsing errors:
- Two debugging tests (prototype-pollution-check-v2, prototype-pollution-merge-v2) failed with error: Spread syntax requires ...iterable[Symbol.iterator] to be a function — likely model generated invalid JS syntax.
High-latency outliers:
- debugging::deep-clone-v2: 68.95s latency (TTFT: 38.64s)
- halluc-edge-string-truncate: 90.73s latency (TTFT: 0.92s, but output speed only 1.16 tok/s)
Cost-efficient categories: cost had lowest total cost ($0.0011) despite high pass rate
High-cost categories: refactoring ($0.0136) and reasoning ($0.0085) due to longer outputs and retries

Conclusion

The deepseek/deepseek-v4-flash model performs well in coding, cost, and speed tasks, showing fast response times and high accuracy. It struggles significantly in refactoring, security, and reasoning, indicating weaknesses in deeper program analysis and logical deduction. Hallucination resistance is moderate, with frequent false claims about API behavior. Latency is generally acceptable, though a few pathological cases (e.g., string truncation, deep clone) show severe performance degradation.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

deepseek/deepseek-v4-flash459226/45971.2%9.98s$0.06

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)