Benchmark run
Started Jun 13, 2026, 9:39 PM · Recorded Jun 13, 2026, 11:24 PM · Ended Jun 13, 2026, 11:24 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated Jun 13, 2026, 11:24 PM · qwen/qwen3-235b-a22b-2507
moonshotai/kimi-k2.7-code4599 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui--limit or --category filter applied)fail_fast=false), so all tests were attempted even after failures236/459 (51.4%)2072.67 out of 3155 (65.7%)12.63s7.58s0.91s93.45 tok/s$1.7530,053441,997472,050>80% pass rate)51/60 (85.0% pass rate, 89.7% score) — excels in easy and medium tasks.26/30 (86.7% pass rate, 82.7% score) — strong on cost-aware refactoring and fixes.49/60 (81.7% pass rate, 87.2% score) — high performance across difficulty levels.38/60 (63.3% pass rate, 74.7% score) — mixed, but strong on API and edge-case knowledge.<40% pass rate)5/60 (8.3% pass rate, 37.3% score) — extremely poor, especially on medium and hard tasks.14/60 (23.3% pass rate, 73.6% score) — low pass rate despite moderate scoring; struggles with constraint logic.23/60 (38.3% pass rate, 53.1% score) — inconsistent, with many partial or failed diagnoses.24/60 (40.0% pass rate, 71.1% score) — moderate pass rate but decent scoring due to partial credit.6/9 (66.7% pass rate, 63.0% score) — fails both hard-level UI tests.55 out of 60 refactoring tests.easy refactoring tasks, pass rate is only 1/20 (5%).$0.399) relative to performance.23.3%, but score is 73.6% — suggests the model receives partial credit for plausible but incorrect reasoning.9/20 passed — some success on common bugs like null checks.6/20 and 8/20), with many 0-score results.WeakRef, Intl.Segmenter, structuredClone).halluc-doc-node-stream-finished — incorrectly describes Node.js stream lifecycle.coding-hard-diff-objects (80.3s).coding-hard-diff-objects due to 4,460 total tokens.coding-medium-top-k (28.48 tok/s), possibly due to long wait times.ui::hard tests (0/2 passed), including one with 11.25s TTFT.ui::easy and 5/6 ui::medium tests.The moonshotai/kimi-k2.7-code model shows strong coding and cost optimization skills, particularly on well-defined programming tasks. However, it struggles severely with refactoring, has inconsistent debugging ability, and fails to reason reliably about system constraints. Its hallucination resistance is moderate, passing many factual API checks but failing behavioral claims. The model is cost-effective for coding, but inefficient for complex reasoning or structural refactoring.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.4
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)