BLXBench - Run run

Benchmark run

Started Jun 17, 2026, 11:21 PM · Recorded Jun 18, 2026, 1:09 AM · Ended Jun 18, 2026, 1:09 AM

Test suite v2 — Resilience · 045d4510abd0…

66.4Blended scoreTests 459Models 1

Passed199

Failed260

Pass rate43.4%

Duration6534.0s

Categories9

Models1

Speed avg102.7 t/s

Speed TTFT655ms

Cost/strict$0.00

Strict success93.3%

Score/$n/a

Failed spend$0.00

P50 task cost$0.00

P90 task cost$0.00

Est. cost (run)$0.00

Tokens (Σ results)25.9k / 190.6k

Submitted byBitslix

Run summary

Generated Jun 18, 2026, 1:10 AM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: cohere/north-mini-code:free
Total tests: 459
Categories included: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full category coverage with level breakdowns (easy/medium/hard), no test limiting ("limit": null)
Fail-fast behavior: Disabled ("fail_fast": false)
Cost tracking: Enabled but total cost was \$0.00 (free tier)
Token usage: 25,949 prompt tokens, 190,641 completion tokens
Results status: Partially truncated ("results_truncated": true), but summary aggregates are complete.

Performance Overview

Pass rate: 199/459 (43.36%)
Score: 2090.63 / 3155 (66.26%)
Latency:
- Average: 5.59s
- Median: 2.44s
- High TTFT (Time to First Token): 1.16s average
Output speed: 210.89 tok/s average
No errors or API costs reported.

Category-Level Performance

High-Performing Categories

Cost: 28/30 passed (93.33%), score 132/150 (88%)
- Strong across all levels, including 10/10 on hard tests.
Speed: 51/60 passed (85%), score 164/180 (91.11%)
- Fastest average output speed in easy/medium tiers.
Debugging: 31/60 passed (51.67%), score 438/601 (72.88%)
- Solid mid-tier performance; best in easy (13/20 passed).
Hallucination: 32/60 passed (53.33%), score 266/360 (73.89%)
- Strong on hard API/edge cases (e.g., fetch-timeout, stream-pipeline passed).

Low-Performing Categories

Reasoning: 11/60 passed (18.33%), score 360/541 (66.54%)
- Very high TTFT (4.49s avg), suggesting slow reasoning or prompt processing.
- Poor performance on constraint-based logic (e.g., batch-window, rollout-window failed).
Refactoring: 16/60 passed (26.67%), score 374/541 (69.13%)
- Slowest output speed: 74.22 tok/s avg.
Security: 11/60 passed (18.33%), score 279/512 (54.49%)
- Weak on hard tests: only 2/20 passed.
Coding: 14/60 passed (23.33%), score 73/261 (27.97%)
- Struggles with basic utilities (e.g., chunk-array, is-palindrome, title-case failed).
- Only 3/20 hard coding tasks passed.

UI

UI: 5/9 passed (55.56%)
- Failed both hard tests (ui::hard): score 0/2.
- Only one medium test failed; easy test passed.

Notable Observations

Latency disparity: The model shows a wide latency range. While median is 2.44s, average is 5.59s, indicating long-tail outliers (e.g., debugging test with 7.29s).
High TTFT in reasoning: 4.49s average TTFT in reasoning suggests significant prompt processing delay, likely due to complex context or model inefficiency.
Strong on factual/hallucination resistance: Performed well on API documentation and edge-case claims, correctly avoiding hallucinated behaviors.
Inconsistent medium/hard coding: Passed complex tasks like expression-evaluator and consist-hash, but failed simpler ones like chunk-array and truncate-string.

Summary

The cohere/north-mini-code:free model demonstrates strong performance in cost optimization, speed, and hallucination resistance, but struggles significantly with coding fundamentals, reasoning under constraints, and security tasks. It excels at avoiding false claims and generating efficient fixes but lacks consistency in algorithmic problem-solving. Latency is acceptable for most tasks but degrades notably in reasoning-heavy scenarios.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

cohere/north-mini-code:free459199/45966.3%5.59s$0.00

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)