Benchmark run

run_f274e6

Started Apr 24, 2026, 8:58 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 9:01 PM

34.9Blended scoreTests 14Models 2

Passed5

Failed9

Pass rate35.7%

Duration185.0s

Categories7

Models2

Est. cost (run)$0.00

Submitted byBitslix

Tokens (Σ results)1.1k / 6.8k

Run summary

Generated Apr 24, 2026, 9:01 PM · qwen/qwen3-235b-a22b-2507

Scope

Models tested: tencent/hy3-preview:free, baidu/qianfan-ocr-fast:free
Categories: 7 total — coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Tests per model: 7 (full category coverage)
Run configuration: CLI option --limit 1 was used, but all 7 tests per model were executed, suggesting the limit may not have applied or was interpreted per-category.
Fail-fast mode: false — all tests ran even after failures.
Rate limiting: 7 RPM enforced.
Cost tracking: All responses had total_cost_usd: 0.

Performance

Overall Pass/Fail Summary

Model	Passed / Total	Pass Rate	Score / Max Score	Score %
`baidu/qianfan-ocr-fast:free`	`4/7`	`0.571`	`4.15 / 7`	`59.29%`
`tencent/hy3-preview:free`	`1/7`	`0.143`	`0.73 / 7`	`10.43%`

Clear leader: baidu/qianfan-ocr-fast:free significantly outperformed tencent/hy3-preview:free across all metrics.

Latency and Speed

Average latency:
- baidu/qianfan-ocr-fast:free: 3.29s
- tencent/hy3-preview:free: 13.50s
Median latency:
- baidu/qianfan-ocr-fast:free: 1.71s
- tencent/hy3-preview:free: 3.63s
Time to first token (TTFT):
- baidu/qianfan-ocr-fast:free: avg 1.78s
- tencent/hy3-preview:free: avg 48.39s (only 1 sample due to missing TTFT in 6 tests)
Output speed:
- baidu/qianfan-ocr-fast:free: avg 1476.56 tok/s
- tencent/hy3-preview:free: avg 51.25 tok/s

⚠️ The tencent/hy3-preview:free model reported ttft_seconds: null in 6/7 tests, suggesting either a logging issue or extremely delayed or missing streaming output.

Category-Level Observations

baidu/qianfan-ocr-fast:free passed all tests in:
- hallucination (1/1)
- reasoning (1/1)
- security (1/1)
- speed (1/1)
It failed in:
- coding_ui (score: 0.15) — low score despite generating UI code.
- debugging — incorrect fix: returned age >= 21 instead of age >= 18.
- refactoring — logic error: did not .title() names as expected.
tencent/hy3-preview:free passed only:
- coding_ui (score: 0.73) — generated a detailed analog clock UI with styling.
Failed all other categories:
- debugging, reasoning, security, etc. — all score: 0
- Very slow TTFT (48.39s in coding test), though it passed that single test.

Notable Failures and Anomalies

tencent/hy3-preview:free TTFT reporting issue:
- ttft_seconds was null in 6/7 tests despite non-zero latency, suggesting output may have been generated in a non-streaming way or with delayed start.
- This raises concerns about real-time usability.
baidu/qianfan-ocr-fast:free debugging failure:
- Incorrectly changed age >= 18 to age >= 21, failing debugging_easy_01_fix_greater_than.
Refactoring logic mismatch:
- baidu/qianfan-ocr-fast:free omitted .title() on names in process_users_refactor, causing failure despite otherwise correct filtering.
High output volume from Tencent:
- In coding_ui, tencent/hy3-preview:free generated 4867 total tokens vs 1662 from Baidu — high verbosity without full correctness.

Summary

baidu/qianfan-ocr-fast:free demonstrated stronger overall correctness, faster response times, and higher throughput.
tencent/hy3-preview:free suffered from extremely slow TTFT, missing streaming data, and **low pass rate`, despite one high-scoring UI generation.
Both models are zero-cost in this run.
No errors were reported (error: null for all), but Tencent's missing TTFT values suggest potential integration issues.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

baidu/qianfan-ocr-fast:free74/759.3%3.29s$0.00

tencent/hy3-preview:free71/710.4%13.50s$0.00

No matching report.json under results/ — charts use ranking or summary only.

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Discovery

Limited — up to 1 test(s) per category

blxbench argv

tui

Model comparison (score %)

Overall score per model for this run (from overall_ranking run_models). Shown when more than one model participated.

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests per category

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.

14 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)