Benchmark run
Started Jun 17, 2026, 11:21 PM · Recorded Jun 18, 2026, 1:09 AM · Ended Jun 18, 2026, 1:09 AM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 18, 2026, 1:10 AM · qwen/qwen3-235b-a22b-2507
cohere/north-mini-code:free4599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)"limit": null)"fail_fast": false)\$0.00 (free tier)25,949 prompt tokens, 190,641 completion tokens"results_truncated": true), but summary aggregates are complete.199/459 (43.36%)2090.63 / 3155 (66.26%)5.59s2.44s1.16s average210.89 tok/s average28/30 passed (93.33%), score 132/150 (88%)
10/10 on hard tests.51/60 passed (85%), score 164/180 (91.11%)
31/60 passed (51.67%), score 438/601 (72.88%)
easy (13/20 passed).32/60 passed (53.33%), score 266/360 (73.89%)
fetch-timeout, stream-pipeline passed).11/60 passed (18.33%), score 360/541 (66.54%)
4.49s avg), suggesting slow reasoning or prompt processing.batch-window, rollout-window failed).16/60 passed (26.67%), score 374/541 (69.13%)
74.22 tok/s avg.11/60 passed (18.33%), score 279/512 (54.49%)
2/20 passed.14/60 passed (23.33%), score 73/261 (27.97%)
chunk-array, is-palindrome, title-case failed).3/20 hard coding tasks passed.5/9 passed (55.56%)
ui::hard): score 0/2.2.44s, average is 5.59s, indicating long-tail outliers (e.g., debugging test with 7.29s).4.49s average TTFT in reasoning suggests significant prompt processing delay, likely due to complex context or model inefficiency.expression-evaluator and consist-hash, but failed simpler ones like chunk-array and truncate-string.The cohere/north-mini-code:free model demonstrates strong performance in cost optimization, speed, and hallucination resistance, but struggles significantly with coding fundamentals, reasoning under constraints, and security tasks. It excels at avoiding false claims and generating efficient fixes but lacks consistency in algorithmic problem-solving. Latency is acceptable for most tasks but degrades notably in reasoning-heavy scenarios.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)