Benchmark run

run_5b3811

Started Apr 24, 2026, 8:47 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 8:48 PM

81.6Blended scoreTests 7Models 1

Passed6

Failed1

Pass rate85.7%

Duration60.7s

Categories7

Models1

Est. cost (run)$0.00

Submitted byBitslix

Tokens (Σ results)655 / 2.1k

Run summary

Generated Apr 24, 2026, 8:49 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: inclusionai/ling-2.6-1t:free
Total tests run: 7
Categories covered: 7 — coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Run configuration:
- CLI flag --limit 1 was used, but all 7 tests executed — suggesting the limit may not have been applied per-category or the test suite naturally contains only one test per category.
- Rate limiting set to 7 rpm
- No failures caused termination ("fail_fast": false)
- All tests were scored; none skipped

Performance

Overall Results

Passed: 6/7 tests
Score: 5.71 out of 7 (81.57%)
Pass rate: 85.7%
Total cost: $0.00 (all completions free)
Latency:
- Average TTFT (Time to First Token): 0.97s
- Median latency: 1.21s
- Average completion time: 2.34s
Output speed:
- Average: 106.73 tok/s
- Fastest: 148.27 tok/s (refactoring)
- Slowest: 20.86 tok/s (reasoning)

Per-Category Summary

coding_ui:
- Passed: 1/1
- Score: 0.71/1 (71%) — partial credit likely due to suboptimal UI/UX or missing features in generated analog clock HTML/CSS
- Latency: 15.16s — highest in the suite
- Output speed: 133.42 tok/s
debugging:
- Failed: 0/1
- Test: debugging_easy_01_fix_greater_than — model incorrectly changed age >= 18 to age >= 21, introducing a bug
- Latency: 1.19s, TTFT: 1.05s, output speed: 116.66 tok/s
hallucination:
- Passed: 1/1
- Correctly responded with “Not stated” when API version was unspecified
- Fast and efficient: TTFT 0.89s, output speed 94.88 tok/s
reasoning:
- Passed: 1/1
- Correct JSON output: {"result":42}
- Slowest output speed: 20.86 tok/s — may indicate internal deliberation or structured generation overhead
refactoring:
- Passed: 1/1
- Clean, correct Python refactor filtering users by activity, age, country, and name formatting
- Fast output: 148.27 tok/s
security:
- Passed: 1/1
- Correctly identified SQL injection risk and provided mitigation via parameterized queries
- Output included clear explanation and safe code example
speed:
- Passed: 1/1
- Fastest TTFT: 0.82s
- Concise, relevant summary about CI and observability in cloud environments

Notable Observations

One failure in debugging: Despite strong performance across reasoning, security, and refactoring, the model regressed on a basic logic fix, suggesting possible inconsistency in handling simple conditional logic.
High latency in coding_ui: The 15.16s completion time for generating an analog clock UI may reflect longer response length (2163 total tokens) rather than inefficiency.
Zero cost across all tests: Confirms use of a free tier or no-cost provider endpoint.
Strong anti-hallucination behavior: Correctly avoided inventing an API version when not provided.

Conclusion

The model inclusionai/ling-2.6-1t:free demonstrates solid overall performance with 6/7 passes and strong scores in security, refactoring, and reasoning. The only failure occurred in a basic debugging task, which may warrant further investigation. Latency and output speed are generally favorable, though generation speed varies significantly by task type.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

inclusionai/ling-2.6-1t:free76/781.6%3.31s$0.00

No matching report.json under results/ — charts use ranking or summary only.

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Discovery

Limited — up to 1 test(s) per category

blxbench argv

tui

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests per category

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.

7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)