BLXBench - Run run

Benchmark run

Started Apr 28, 2026, 8:06 PM · Recorded Apr 28, 2026, 9:01 PM · Ended Apr 28, 2026, 9:01 PM

Test suite v1 — Nutrition · 17bc604b897e…

29.9Blended scoreTests 373Models 1

Passed112

Failed261

Pass rate30.0%

Duration3310.7s

Categories7

Models1

Speed avg14597.2 t/s

Speed TTFT1435ms

Est. cost (run)$0.00

Tokens (Σ results)22.5k / 82.0k

Submitted byBitslix

Run summary

Generated Apr 28, 2026, 9:01 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free
Total tests: 373
Categories: coding_ui, debugging, hallucination, reasoning, refactoring, security, speed
Run mode: Full benchmark (no --limit, --category, or --level filters)
Fail-fast: Disabled (fail_fast=false)
Cost: $0.00 (all runs on free tier)
Results truncated: Yes (results_truncated=true), but full aggregates are available.

Performance Summary

Overall pass rate: 112/373 (30.0%)
Total score: 111.99 out of 375 (29.9%)
Average latency: 1.32s
Median latency: 0.897s
Average TTFT (Time to First Token): 1.44s
Average output speed: 5598.2 tok/s

Category-Level Performance

Strongest Performers

Debugging: 39/60 (65% pass rate, 65% score) — best overall category.
- Hard level: 16/20 (80% pass rate), indicating strong performance on complex bugs.
Refactoring: 26/60 (43.3% pass rate)
- Hard level: 12/20 (60% pass rate), showing better performance on harder tasks.
Security: 28/60 (46.7% pass rate)
- Hard level: 13/20 (65% pass rate)

Weakest Performers

Hallucination: 0/60 (0% pass rate) — failed every test.
- Includes easy, medium, and hard levels — consistent failure to avoid fabricated information.
Reasoning: 9/62 (14.5% pass rate, 14.1% score)
- Only 2/22 on easy reasoning tasks passed.
- Hard and medium levels had near-zero success except reasoning_medium_09 to 19, where 5 passed (mostly logic tasks with JSON or structured output).
Speed: 8/65 (12.3% pass rate)
- Very low pass rate across levels, worst on hard (1/23).

Coding UI

2/6 passed (33.3%), with one partial pass (0.5725 score).
High TTFT (16.02s avg on easy), especially on breakout_game (24.24s TTFT).
Output speed varied widely: 1111.16 tok/s avg, but low on complex tasks.

Notable Observations

High output speed in some areas:
- debugging_easy_15_fix_split_join: 76817 tok/s
- debugging_hard_20_bugfix: 19118 tok/s
- Indicates fast generation when correct.
TTFT inconsistencies:
- coding_ui had highest TTFT (16.02s on easy), suggesting slow start on UI generation.
- hallucination had very low TTFT (0.63s on hard), possibly due to short, incorrect responses.
Reasoning failures:
- Failed all basic math (addition, subtraction, etc.) and logic checks.
- Only passed a few structured logic/JSON tasks in medium level.
No hallucination mitigation: Model consistently failed to avoid stating unconfirmed facts — critical weakness.
Latency outliers:
- breakout_game: 31.44s total latency.
- thunderstorm_over_city: 26.58s.

Conclusion

The nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free model performs moderately well in debugging, especially on hard cases, but fails severely in reasoning and hallucination avoidance. Its high output speed is promising, but slow TTFT on UI tasks and complete failure on hallucination tests are major concerns. The model may be optimized for fast, fluent output but lacks reliability in correctness and factual consistency.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free373112/37329.9%1.32s$0.00

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v1 — Nutrition

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

Not recorded (older report.json)

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)