Benchmark run
Started May 10, 2026, 8:43 PM · Recorded May 10, 2026, 9:06 PM · Ended May 10, 2026, 9:06 PM
Test suite v2 — Resilience · 045d4510abd0…
Generated May 10, 2026, 9:06 PM · qwen/qwen3-235b-a22b-2507
mistralai/mistral-medium-3-54599 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)--limit or --category filter)fail_fast=false)results_truncated=true), meaning full per-test details may not be included221/459 (48.15%)2205.92 out of 3137 (70.32%)$1.392.78s1.57s0.299s166.00 tok/s52/60 passed (86.67%), score 166/180 (92.22%)28/30 passed (93.33%), score 135/150 (90.00%)47/60 passed (78.33%), score 232/261 (88.89%)8/9 passed (88.89%), score 6.92/9 (76.87%)7/60 passed (11.67%), score 331/541 (61.18%)16/60 passed (26.67%), score 308/512 (60.16%)15/60 passed (25.00%), score 367/541 (67.84%)23/60 passed (38.33%), score 406/583 (69.64%)25/60 passed (41.67%), score 254/360 (70.56%)reason-constraint series.easy reasoning tasks (e.g., reason-ce-even-number, reason-rc-login-timeout) suggest gaps in logical consistency.reason-constraint-subscription-migration (9/9) and reason-constraint-batch-window (9/9) shows capability when constraints are well-structured.easy and medium levels.debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 both failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function".debug-* tests scored partial points (e.g., 6/10, 7/10), indicating partial understanding.halluc-api-fetch-timeout, halluc-bug-shared-defaults).halluc-doc-middleware-chain, halluc-doc-node-stream-finished), suggesting tendency to invent behaviors.11.67%), with only 1/20 easy tests passed.26.67% pass rate), particularly in easy tests (4/20), indicating fundamental gaps in identifying security flaws.cost category (93.33%), the model generated a total cost of $1.39, with 180,440 completion tokens.reasoning (47,764 completion tokens) and refactoring (32,846) suggests verbose or inefficient outputs.cost (201.13 tok/s)reasoning (158.92 tok/s)coding (0.345s), likely due to complex code generationhallucination (0.262s)The mistralai/mistral-medium-3-5 model performs well in speed, cost-awareness, and coding tasks, but shows significant weaknesses in debugging, reasoning under constraints, refactoring, and security. It exhibits hallucination tendencies in API and documentation claims, and struggles with logical reasoning even at easy levels. While output speed is generally strong (166 tok/s avg), high token usage drives up cost. The model may benefit from fine-tuning or prompting strategies to improve consistency and reduce hallucinations.
Per-model aggregates from overall_ranking.json for this run id.
Values are read from report.json when the benchmark wrote them.
Test suite
v2 — Resilience
Discovery
Full suite discovery (no --limit)
blxbench argv
tui
App version
v1.3.2
Resumed run
No
Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.
Avg score % (bars) and strict success rate % (line) per cost cluster.
Per-test latency (seconds), successful timings only.
Normalized TTFT (inverted) vs decode tok/s per category for this run.
Average score % per metric dimension across all v2 tasks in this run.
Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).
Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).
459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)