BLXBench - Run run

Benchmark run

Started Jun 13, 2026, 9:39 PM · Recorded Jun 13, 2026, 11:24 PM · Ended Jun 13, 2026, 11:24 PM

Test suite v2 — Resilience · 045d4510abd0…

70.3Blended scoreTests 459Models 1

Passed236

Failed223

Pass rate51.4%

Duration6256.1s

Categories9

Models1

Speed avg77.5 t/s

Speed TTFT807ms

Cost/strict$0.0021

Strict success86.7%

Score/$45518.07

Failed spend$0.01

P50 task cost$0.0013

P90 task cost$0.0040

Est. cost (run)$1.75

Tokens (Σ results)30.1k / 442.0k

Submitted byBitslix

Run summary

Generated Jun 13, 2026, 11:24 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: moonshotai/kimi-k2.7-code
Total tests run: 459
Categories covered: 9 — coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui
Run mode: Full benchmark (no --limit or --category filter applied)
Fail-fast behavior: Disabled (fail_fast=false), so all tests were attempted even after failures

Performance Summary

Overall

Pass rate: 236/459 (51.4%)
Score: 2072.67 out of 3155 (65.7%)
Latency:
- Average: 12.63s
- Median: 7.58s
Time to first token (TTFT): Average 0.91s
Output speed: Average 93.45 tok/s
Total cost: $1.75
Tokens used:
- Prompt: 30,053
- Completion: 441,997
- Total: 472,050

Category-Level Performance

Strong Performers (`>80% pass rate`)

Coding: 51/60 (85.0% pass rate, 89.7% score) — excels in easy and medium tasks.
Cost: 26/30 (86.7% pass rate, 82.7% score) — strong on cost-aware refactoring and fixes.
Speed: 49/60 (81.7% pass rate, 87.2% score) — high performance across difficulty levels.
Hallucination: 38/60 (63.3% pass rate, 74.7% score) — mixed, but strong on API and edge-case knowledge.

Weak Performers (`<40% pass rate`)

Refactoring: 5/60 (8.3% pass rate, 37.3% score) — extremely poor, especially on medium and hard tasks.
Reasoning: 14/60 (23.3% pass rate, 73.6% score) — low pass rate despite moderate scoring; struggles with constraint logic.
Debugging: 23/60 (38.3% pass rate, 53.1% score) — inconsistent, with many partial or failed diagnoses.

Moderate Performers

Security: 24/60 (40.0% pass rate, 71.1% score) — moderate pass rate but decent scoring due to partial credit.
UI: 6/9 (66.7% pass rate, 63.0% score) — fails both hard-level UI tests.

Notable Observations

Refactoring is a Critical Weakness

The model fails 55 out of 60 refactoring tests.
Even on easy refactoring tasks, pass rate is only 1/20 (5%).
Cost is high in this category ($0.399) relative to performance.

Reasoning: High Score Despite Low Pass Rate

Pass rate is only 23.3%, but score is 73.6% — suggests the model receives partial credit for plausible but incorrect reasoning.
Struggles with constraint-based logic (e.g., rate limits, retention policies, rollout windows).

Debugging: Inconsistent Across Levels

Easy: 9/20 passed — some success on common bugs like null checks.
Medium/Hard: Very low pass rates (6/20 and 8/20), with many 0-score results.
Notable failures in concurrency, race conditions, and deep clone bugs.

Hallucination: Mixed Results

Passes several hard API and edge-case knowledge checks (e.g., WeakRef, Intl.Segmenter, structuredClone).
Fails on claims about deduplication, sorting, and nested merge behavior.
Fails halluc-doc-node-stream-finished — incorrectly describes Node.js stream lifecycle.

Latency and Cost

Highest latency test: coding-hard-diff-objects (80.3s).
Most expensive test: Likely coding-hard-diff-objects due to 4,460 total tokens.
Lowest output speed: coding-medium-top-k (28.48 tok/s), possibly due to long wait times.

UI Failures

Fails both ui::hard tests (0/2 passed), including one with 11.25s TTFT.
Only passes ui::easy and 5/6 ui::medium tests.

Conclusion

The moonshotai/kimi-k2.7-code model shows strong coding and cost optimization skills, particularly on well-defined programming tasks. However, it struggles severely with refactoring, has inconsistent debugging ability, and fails to reason reliably about system constraints. Its hallucination resistance is moderate, passing many factual API checks but failing behavioral claims. The model is cost-effective for coding, but inefficient for complex reasoning or structural refactoring.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

moonshotai/kimi-k2.7-code459236/45965.7%12.63s$1.75

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.4

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)