Benchmark run
Started Apr 24, 2026, 8:47 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 8:48 PM
Generated Apr 24, 2026, 8:49 PM · qwen/qwen3-235b-a22b-2507
inclusionai/ling-2.6-1t:free77 — coding_ui, debugging, hallucination, reasoning, refactoring, security, speed--limit 1 was used, but all 7 tests executed — suggesting the limit may not have been applied per-category or the test suite naturally contains only one test per category.7 rpm"fail_fast": false)6/7 tests5.71 out of 7 (81.57%)85.7%$0.00 (all completions free)0.97s1.21s2.34s106.73 tok/s148.27 tok/s (refactoring)20.86 tok/s (reasoning)coding_ui:
1/10.71/1 (71%) — partial credit likely due to suboptimal UI/UX or missing features in generated analog clock HTML/CSS15.16s — highest in the suite133.42 tok/sdebugging:
0/1debugging_easy_01_fix_greater_than — model incorrectly changed age >= 18 to age >= 21, introducing a bug1.19s, TTFT: 1.05s, output speed: 116.66 tok/shallucination:
1/10.89s, output speed 94.88 tok/sreasoning:
1/1{"result":42}20.86 tok/s — may indicate internal deliberation or structured generation overheadrefactoring:
1/1148.27 tok/ssecurity:
1/1speed:
1/10.82scoding_ui: The 15.16s completion time for generating an analog clock UI may reflect longer response length (2163 total tokens) rather than inefficiency.The model inclusionai/ling-2.6-1t:free demonstrates solid overall performance with 6/7 passes and strong scores in security, refactoring, and reasoning. The only failure occurred in a basic debugging task, which may warrant further investigation. Latency and output speed are generally favorable, though generation speed varies significantly by task type.
Per-model aggregates from overall_ranking.json for this run id.
No matching report.json under results/ — charts use ranking or summary only.
Values are read from report.json when the benchmark wrote them.
Discovery
Limited — up to 1 test(s) per category
blxbench argv
tui
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.
7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)