BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:05 PM · Recorded May 10, 2026, 8:54 PM · Ended May 10, 2026, 8:54 PM

Test suite v2 — Resilience · 045d4510abd0…

78.3Blended scoreTests 459Models 1

Passed278

Failed181

Pass rate60.6%

Duration2942.6s

Categories9

Models1

Speed avg115.8 t/s

Speed TTFT722ms

Cost/strict$0.0006

Strict success93.3%

Score/$145431.81

Failed spend$0.0020

P50 task cost$0.0006

P90 task cost$0.0010

Est. cost (run)$1.10

Tokens (Σ results)30.3k / 271.7k

Submitted byBitslix

Run summary

Generated May 10, 2026, 8:54 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: moonshotai/kimi-k2.6
Total tests: 459
Categories covered: All available categories were tested (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category filter); fail_fast was disabled
Truncation note: Results are truncated (results_truncated: true), meaning not all individual test outcomes are included in this payload.

Performance Summary

Overall pass rate: 278/459 (60.57%)
Score: 2346.92 out of 3137 (74.81%)
Total cost: $1.10
Average latency: 5.82s
Median latency: 3.31s
Average time to first token (TTFT): 0.43s
Average output speed: 130.21 tok/s

Category-Level Performance

High-Performing Categories

Coding: 54/60 passed (90% pass rate), score 249/261 (95.4%). Strong across all difficulty levels, especially easy (100% score).
Cost: 28/30 passed (93.33%), score 132/150 (88%). Excellent performance, particularly on medium difficulty (100% pass rate).
Speed: 55/60 passed (91.67%), score 169/180 (93.89%). High consistency, with easy and hard levels both scoring ≥95%.

Moderate Performers

Security: 30/60 passed (50%), score 359/512 (70.12%). Performance declines with difficulty (easy: 45% pass, hard: 50%).
Hallucination: 33/60 passed (55%), score 277/360 (76.94%). Mixed results; better on hard (60% pass) than easy (40%).
Debugging: 38/60 passed (63.33%), score 466/583 (79.93%). Solid mid-tier performance, improves with difficulty (hard: 75% pass).

Low-Performing Categories

Reasoning: 23/60 passed (38.33%), score 367/541 (67.84%). Struggles across all levels, especially easy (30% pass).
Refactoring: Only 9/60 passed (15%), score 321/541 (59.33%). Very weak performance, worst in the benchmark. easy level pass rate is only 5%.

Notable Observations

High-cost tests: The debugging category incurred the highest cost ($0.106), followed by reasoning ($0.213) and refactoring ($0.207), despite low pass rates in the latter two.
Latency outliers:
- The coding-medium-sliding-window-max test had a very high latency of 18.4s, though it passed.
- Several debugging and reasoning tests exceeded 6s latency.
Output speed variation:
- Fastest output speed in coding (148.8 tok/s avg), slowest in refactoring (94.4 tok/s).
- The cost-complex-retry-with-backoff test achieved a peak speed of 187.3 tok/s.
Failures with errors: Two debugging tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) failed with the error: Spread syntax requires ...iterable[Symbol.iterator] to be a function, indicating a possible model hallucination or code generation flaw.

Conclusion

The moonshotai/kimi-k2.6 model performs strongly in coding, cost, and speed tasks, demonstrating reliable code generation and efficiency. However, it struggles significantly with reasoning and especially refactoring tasks, suggesting limitations in structural code transformation and logical inference. Hallucination resistance is moderate, but not robust. The high cost in low-pass-rate categories indicates inefficient or verbose outputs under complexity.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

moonshotai/kimi-k2.6459278/45974.8%5.82s$1.10

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)