Benchmark run
Started Apr 24, 2026, 8:58 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 9:01 PM
Generated Apr 24, 2026, 9:01 PM · qwen/qwen3-235b-a22b-2507
tencent/hy3-preview:free, baidu/qianfan-ocr-fast:freecoding_ui, debugging, hallucination, reasoning, refactoring, security, speed7 (full category coverage)--limit 1 was used, but all 7 tests per model were executed, suggesting the limit may not have applied or was interpreted per-category.false — all tests ran even after failures.7 RPM enforced.total_cost_usd: 0.| Model | Passed / Total | Pass Rate | Score / Max Score | Score % |
|---|---|---|---|---|
baidu/qianfan-ocr-fast:free | 4/7 | 0.571 | 4.15 / 7 | 59.29% |
tencent/hy3-preview:free | 1/7 | 0.143 | 0.73 / 7 | 10.43% |
baidu/qianfan-ocr-fast:free significantly outperformed tencent/hy3-preview:free across all metrics.baidu/qianfan-ocr-fast:free: 3.29stencent/hy3-preview:free: 13.50sbaidu/qianfan-ocr-fast:free: 1.71stencent/hy3-preview:free: 3.63sbaidu/qianfan-ocr-fast:free: avg 1.78stencent/hy3-preview:free: avg 48.39s (only 1 sample due to missing TTFT in 6 tests)baidu/qianfan-ocr-fast:free: avg 1476.56 tok/stencent/hy3-preview:free: avg 51.25 tok/s⚠️ The
tencent/hy3-preview:freemodel reportedttft_seconds: nullin6/7tests, suggesting either a logging issue or extremely delayed or missing streaming output.
baidu/qianfan-ocr-fast:free passed all tests in:
hallucination (1/1)reasoning (1/1)security (1/1)speed (1/1)It failed in:
coding_ui (score: 0.15) — low score despite generating UI code.debugging — incorrect fix: returned age >= 21 instead of age >= 18.refactoring — logic error: did not .title() names as expected.tencent/hy3-preview:free passed only:
coding_ui (score: 0.73) — generated a detailed analog clock UI with styling.Failed all other categories:
debugging, reasoning, security, etc. — all score: 048.39s in coding test), though it passed that single test.tencent/hy3-preview:free TTFT reporting issue:
ttft_seconds was null in 6/7 tests despite non-zero latency, suggesting output may have been generated in a non-streaming way or with delayed start.baidu/qianfan-ocr-fast:free debugging failure:
age >= 18 to age >= 21, failing debugging_easy_01_fix_greater_than.Refactoring logic mismatch:
baidu/qianfan-ocr-fast:free omitted .title() on names in process_users_refactor, causing failure despite otherwise correct filtering.High output volume from Tencent:
coding_ui, tencent/hy3-preview:free generated 4867 total tokens vs 1662 from Baidu — high verbosity without full correctness.baidu/qianfan-ocr-fast:free demonstrated stronger overall correctness, faster response times, and higher throughput.tencent/hy3-preview:free suffered from extremely slow TTFT, missing streaming data, and **low pass rate`, despite one high-scoring UI generation.error: null for all), but Tencent's missing TTFT values suggest potential integration issues.Per-model aggregates from overall_ranking.json for this run id.
No matching report.json under results/ — charts use ranking or summary only.
Values are read from report.json when the benchmark wrote them.
Discovery
Limited — up to 1 test(s) per category
blxbench argv
tui
Overall score per model for this run (from overall_ranking run_models). Shown when more than one model participated.
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.
14 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)