Benchmark run
Started May 10, 2026, 8:05 PM · Recorded May 10, 2026, 8:54 PM · Ended May 10, 2026, 8:54 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 8:54 PM · qwen/qwen3-235b-a22b-2507
moonshotai/kimi-k2.6459coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category filter); fail_fast was disabledresults_truncated: true), meaning not all individual test outcomes are included in this payload.278/459 (60.57%)2346.92 out of 3137 (74.81%)$1.105.82s3.31s0.43s130.21 tok/s54/60 passed (90% pass rate), score 249/261 (95.4%). Strong across all difficulty levels, especially easy (100% score).28/30 passed (93.33%), score 132/150 (88%). Excellent performance, particularly on medium difficulty (100% pass rate).55/60 passed (91.67%), score 169/180 (93.89%). High consistency, with easy and hard levels both scoring ≥95%.30/60 passed (50%), score 359/512 (70.12%). Performance declines with difficulty (easy: 45% pass, hard: 50%).33/60 passed (55%), score 277/360 (76.94%). Mixed results; better on hard (60% pass) than easy (40%).38/60 passed (63.33%), score 466/583 (79.93%). Solid mid-tier performance, improves with difficulty (hard: 75% pass).23/60 passed (38.33%), score 367/541 (67.84%). Struggles across all levels, especially easy (30% pass).9/60 passed (15%), score 321/541 (59.33%). Very weak performance, worst in the benchmark. easy level pass rate is only 5%.debugging category incurred the highest cost ($0.106), followed by reasoning ($0.213) and refactoring ($0.207), despite low pass rates in the latter two.coding-medium-sliding-window-max test had a very high latency of 18.4s, though it passed.debugging and reasoning tests exceeded 6s latency.coding (148.8 tok/s avg), slowest in refactoring (94.4 tok/s).cost-complex-retry-with-backoff test achieved a peak speed of 187.3 tok/s.debugging tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with the error: Spread syntax requires ...iterable[Symbol.iterator] to be a function, indicating a possible model hallucination or code generation flaw.The moonshotai/kimi-k2.6 model performs strongly in coding, cost, and speed tasks, demonstrating reliable code generation and efficiency. However, it struggles significantly with reasoning and especially refactoring tasks, suggesting limitations in structural code transformation and logical inference. Hallucination resistance is moderate, but not robust. The high cost in low-pass-rate categories indicates inefficient or verbose outputs under complexity.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)