Benchmark run
Started Apr 28, 2026, 8:06 PM · Recorded Apr 28, 2026, 9:01 PM · Ended Apr 28, 2026, 9:01 PM
Test suite v1 — Nutrition · 17bc604b897e…
Generated Apr 28, 2026, 9:01 PM · qwen/qwen3-235b-a22b-2507
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free373coding_ui, debugging, hallucination, reasoning, refactoring, security, speed--limit, --category, or --level filters)fail_fast=false)$0.00 (all runs on free tier)results_truncated=true), but full aggregates are available.112/373 (30.0%)111.99 out of 375 (29.9%)1.32s0.897s1.44s5598.2 tok/s39/60 (65% pass rate, 65% score) — best overall category.
16/20 (80% pass rate), indicating strong performance on complex bugs.26/60 (43.3% pass rate)
12/20 (60% pass rate), showing better performance on harder tasks.28/60 (46.7% pass rate)
13/20 (65% pass rate)0/60 (0% pass rate) — failed every test.
9/62 (14.5% pass rate, 14.1% score)
2/22 on easy reasoning tasks passed.reasoning_medium_09 to 19, where 5 passed (mostly logic tasks with JSON or structured output).8/65 (12.3% pass rate)
1/23).2/6 passed (33.3%), with one partial pass (0.5725 score).16.02s avg on easy), especially on breakout_game (24.24s TTFT).1111.16 tok/s avg, but low on complex tasks.debugging_easy_15_fix_split_join: 76817 tok/sdebugging_hard_20_bugfix: 19118 tok/scoding_ui had highest TTFT (16.02s on easy), suggesting slow start on UI generation.hallucination had very low TTFT (0.63s on hard), possibly due to short, incorrect responses.addition, subtraction, etc.) and logic checks.breakout_game: 31.44s total latency.thunderstorm_over_city: 26.58s.The nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free model performs moderately well in debugging, especially on hard cases, but fails severely in reasoning and hallucination avoidance. Its high output speed is promising, but slow TTFT on UI tasks and complete failure on hallucination tests are major concerns. The model may be optimized for fast, fluent output but lacks reliability in correctness and factual consistency.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v1 — Nutrition
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
Not recorded (older report.json)
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
373 tasks in 7 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)