BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:43 PM · Recorded May 10, 2026, 9:06 PM · Ended May 10, 2026, 9:06 PM

Test suite v2 — Resilience · 045d4510abd0…

74.1Blended scoreTests 459Models 1

Passed221

Failed238

Pass rate48.1%

Duration1361.0s

Categories9

Models1

Speed avg178.1 t/s

Speed TTFT287ms

Cost/strict$0.0010

Strict success93.3%

Score/$98573.15

Failed spend$0.0025

P50 task cost$0.0007

P90 task cost$0.0018

Est. cost (run)$1.39

Tokens (Σ results)34.1k / 180.4k

Submitted byBitslix

Run summary

Generated May 10, 2026, 9:06 PM · qwen/qwen3-235b-a22b-2507

Scope

Model tested: mistralai/mistral-medium-3-5
Total tests: 459
Categories covered: 9 (coding, cost, debugging, hallucination, reasoning, refactoring, security, speed, ui)
Run mode: Full benchmark (no --limit or --category filter)
Fail-fast behavior: Disabled (fail_fast=false)
Results truncated: Yes (results_truncated=true), meaning full per-test details may not be included

Performance Summary

Overall pass rate: 221/459 (48.15%)
Overall score: 2205.92 out of 3137 (70.32%)
Total cost: $1.39
Average latency: 2.78s
Median latency: 1.57s
Average time to first token (TTFT): 0.299s
Average output speed: 166.00 tok/s

Category-Level Performance

High-Performing Categories

Speed: 52/60 passed (86.67%), score 166/180 (92.22%)
Cost: 28/30 passed (93.33%), score 135/150 (90.00%)
Coding: 47/60 passed (78.33%), score 232/261 (88.89%)
UI: 8/9 passed (88.89%), score 6.92/9 (76.87%)

Low-Performing Categories

Refactoring: 7/60 passed (11.67%), score 331/541 (61.18%)
Security: 16/60 passed (26.67%), score 308/512 (60.16%)
Reasoning: 15/60 passed (25.00%), score 367/541 (67.84%)
Debugging: 23/60 passed (38.33%), score 406/583 (69.64%)
Hallucination: 25/60 passed (41.67%), score 254/360 (70.56%)

Notable Observations

Reasoning & Constraint Handling

The model struggles with reasoning under constraints, particularly in the reason-constraint series.
Several failures in easy reasoning tasks (e.g., reason-ce-even-number, reason-rc-login-timeout) suggest gaps in logical consistency.
Strong performance on reason-constraint-subscription-migration (9/9) and reason-constraint-batch-window (9/9) shows capability when constraints are well-structured.

Debugging Failures

High failure rate in debugging, especially in easy and medium levels.
Notable errors:
- debug-prototype-pollution-check-v2 and debug-prototype-pollution-merge-v2 both failed with runtime error: "Spread syntax requires ...iterable[Symbol.iterator] to be a function".
- Many debug-* tests scored partial points (e.g., 6/10, 7/10), indicating partial understanding.

Hallucination Risks

Mixed performance on hallucination detection.
Passed several edge-case and API behavior tests (e.g., halluc-api-fetch-timeout, halluc-bug-shared-defaults).
Failed multiple API documentation claims (e.g., halluc-doc-middleware-chain, halluc-doc-node-stream-finished), suggesting tendency to invent behaviors.

Refactoring & Security

Refactoring has the lowest pass rate (11.67%), with only 1/20 easy tests passed.
Security performance is weak (26.67% pass rate), particularly in easy tests (4/20), indicating fundamental gaps in identifying security flaws.

Cost Efficiency

Despite high pass rate in cost category (93.33%), the model generated a total cost of $1.39, with 180,440 completion tokens.
High token usage in reasoning (47,764 completion tokens) and refactoring (32,846) suggests verbose or inefficient outputs.

Latency & Speed

Fastest category: cost (201.13 tok/s)
Slowest category: reasoning (158.92 tok/s)
Highest TTFT: coding (0.345s), likely due to complex code generation
Lowest TTFT: hallucination (0.262s)

Summary

The mistralai/mistral-medium-3-5 model performs well in speed, cost-awareness, and coding tasks, but shows significant weaknesses in debugging, reasoning under constraints, refactoring, and security. It exhibits hallucination tendencies in API and documentation claims, and struggles with logical reasoning even at easy levels. While output speed is generally strong (166 tok/s avg), high token usage drives up cost. The model may benefit from fine-tuning or prompting strategies to improve consistency and reduce hallucinations.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

mistralai/mistral-medium-3-5459221/45970.3%2.78s$1.39

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)