Benchmark run
Started Apr 24, 2026, 9:23 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 9:25 PM
Generated Apr 24, 2026, 9:26 PM · qwen/qwen3-235b-a22b-2507
blxbench_run_summarytencent/hy3-preview:free.7 tests across 7 categories (coding_ui, debugging, hallucination, reasoning, refactoring, security, speed), with one test per category.1 test per category via --limit 1."fail_fast": false), so all tests ran despite failures.7 RPM.$0.00 across all tests.0/7 tests passed.0%.0.15 out of a maximum possible 7.2.14%.16.10s.3.89s.58.61s (based on 1 sample).coding_ui test reported TTFT; all others had null TTFT.41.82 tok/s.149.00 tok/s (coding_ui).6.97 tok/s (hallucination).coding_ui (analog_clock):
false.0.15/1.87.18s — the longest of all tests.58.61s — extremely high, likely impacting overall average.149.00 tok/s — fastest in the run.282 prompt + 4256 completion = 4538 total.debugging (debugging_easy_01_fix_greater_than):
false.0/1.3.36s.29.78 tok/s.hallucination (hallucination_easy_01_not_stated_api_version):
false.0/1.2.87s.6.97 tok/s — slowest in the run.reasoning (json_output_test):
false.0/1.3.34s.23.92 tok/s.refactoring (process_users_refactor):
false.0/1.7.93s.32.81 tok/s.security (security_easy_01_sql_concat):
false.0/1.3.89s.30.87 tok/s.speed (speed_easy_01_summary_cloud):
false.0/1.4.12s.19.41 tok/s.7 tests failed (passed: false), indicating consistent underperformance across diverse task types.1 test (coding_ui) recorded TTFT (58.61s), which is unusually high. All other tests have null TTFT — this may indicate instrumentation issues or model/provider-specific behavior.6.97 tok/s to 149.00 tok/s, suggesting highly variable generation efficiency depending on task.coding_ui test generated 4256 completion tokens, far more than others — possibly due to verbose HTML/CSS output, though still scored only 0.15.total_cost_usd: 0), consistent with the :free model tag.The model tencent/hy3-preview:free failed all 7 benchmark tests with a near-zero overall score (2.14%). Performance was particularly poor in TTFT (58.61s on one task) and output speed (as low as 6.97 tok/s). Despite generating large outputs (e.g., 4538 tokens in coding_ui), correctness or alignment with expectations was severely lacking. Further investigation into response quality and TTFT measurement reliability is recommended.
Per-model aggregates from overall_ranking.json for this run id.
No matching report.json under results/ — charts use ranking or summary only.
Values are read from report.json when the benchmark wrote them.
Discovery
Limited — up to 1 test(s) per category
blxbench argv
tui
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.
7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)