Benchmark run

run_5434c2

Started Apr 24, 2026, 9:23 PM · Recorded Apr 24, 2026, 11:06 PM · Ended Apr 24, 2026, 9:25 PM

2.1Blended scoreTests 7Models 1

Passed0

Failed7

Pass rate0.0%

Duration136.0s

Categories7

Models1

Est. cost (run)$0.00

Submitted byBitslix

Tokens (Σ results)589 / 4.9k

Run summary

Generated Apr 24, 2026, 9:26 PM · qwen/qwen3-235b-a22b-2507

Scope

Purpose: blxbench_run_summary
Run configuration: A single model was tested: tencent/hy3-preview:free.
Tests executed: 7 tests across 7 categories (coding_ui, debugging, hallucination, reasoning, refactoring, security, speed), with one test per category.
Test limit: The run was limited to 1 test per category via --limit 1.
Fail-fast mode: Disabled ("fail_fast": false), so all tests ran despite failures.
Rate limiting: Enforced at 7 RPM.
Cost run: Total cost was $0.00 across all tests.

Performance

Overall Results

Total passed: 0/7 tests passed.
Pass rate: 0%.
Score summary:
- Total score: 0.15 out of a maximum possible 7.
- Score percentage: 2.14%.
Latency:
- Average latency: 16.10s.
- Median latency: 3.89s.
Time to First Token (TTFT):
- Average TTFT: 58.61s (based on 1 sample).
- Only the coding_ui test reported TTFT; all others had null TTFT.
Output speed:
- Average output speed: 41.82 tok/s.
- Highest output speed: 149.00 tok/s (coding_ui).
- Lowest output speed: 6.97 tok/s (hallucination).

Per-Category Summary

coding_ui (analog_clock):
- Passed: false.
- Score: 0.15/1.
- Latency: 87.18s — the longest of all tests.
- TTFT: 58.61s — extremely high, likely impacting overall average.
- Output speed: 149.00 tok/s — fastest in the run.
- Tokens: 282 prompt + 4256 completion = 4538 total.
debugging (debugging_easy_01_fix_greater_than):
- Passed: false.
- Score: 0/1.
- Latency: 3.36s.
- Output speed: 29.78 tok/s.
hallucination (hallucination_easy_01_not_stated_api_version):
- Passed: false.
- Score: 0/1.
- Latency: 2.87s.
- Output speed: 6.97 tok/s — slowest in the run.
reasoning (json_output_test):
- Passed: false.
- Score: 0/1.
- Latency: 3.34s.
- Output speed: 23.92 tok/s.
refactoring (process_users_refactor):
- Passed: false.
- Score: 0/1.
- Latency: 7.93s.
- Output speed: 32.81 tok/s.
security (security_easy_01_sql_concat):
- Passed: false.
- Score: 0/1.
- Latency: 3.89s.
- Output speed: 30.87 tok/s.
speed (speed_easy_01_summary_cloud):
- Passed: false.
- Score: 0/1.
- Latency: 4.12s.
- Output speed: 19.41 tok/s.

Observations

Universal failure: All 7 tests failed (passed: false), indicating consistent underperformance across diverse task types.
TTFT anomaly: Only 1 test (coding_ui) recorded TTFT (58.61s), which is unusually high. All other tests have null TTFT — this may indicate instrumentation issues or model/provider-specific behavior.
Output speed variation: Ranges from 6.97 tok/s to 149.00 tok/s, suggesting highly variable generation efficiency depending on task.
High completion tokens in coding_ui: The coding_ui test generated 4256 completion tokens, far more than others — possibly due to verbose HTML/CSS output, though still scored only 0.15.
Zero cost: All queries were served at no cost (total_cost_usd: 0), consistent with the :free model tag.

Conclusion

The model tencent/hy3-preview:free failed all 7 benchmark tests with a near-zero overall score (2.14%). Performance was particularly poor in TTFT (58.61s on one task) and output speed (as low as 6.97 tok/s). Despite generating large outputs (e.g., 4538 tokens in coding_ui), correctness or alignment with expectations was severely lacking. Further investigation into response quality and TTFT measurement reliability is recommended.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

tencent/hy3-preview:free70/72.1%16.10s$0.00

No matching report.json under results/ — charts use ranking or summary only.

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Discovery

Limited — up to 1 test(s) per category

blxbench argv

tui

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests per category

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table.

7 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)