BLXBench - Run run

Benchmark run

Started May 10, 2026, 8:16 PM · Recorded May 10, 2026, 8:40 PM · Ended May 10, 2026, 8:40 PM

Test suite v2 — Resilience · 045d4510abd0…

72.2Blended scoreTests 459Models 1

Passed207

Failed252

Pass rate45.1%

Duration1407.1s

Categories9

Models1

Speed avg182.7 t/s

Speed TTFT287ms

Cost/strict$0.0001

Strict success96.7%

Score/$1245054.74

Failed spend$0.0003

P50 task cost$0.0001

P90 task cost$0.0001

Est. cost (run)$0.10

Tokens (Σ results)34.1k / 159.8k

Submitted byBitslix

Run summary

Generated May 10, 2026, 8:40 PM · qwen/qwen3-235b-a22b-2507

Scope

This benchmark run evaluated a single model: mistralai/mistral-small-2603. A total of 458 tests were executed across multiple categories, with no test limiting or early termination (fail_fast was false). The run included all test levels and covered 9 distinct categories. The results are truncated, indicating not all test outcomes may be present in the payload.

Performance Summary

The model achieved an overall pass rate of 207/458 (45.2%) and a score percentage of 67.77% (2119.94 out of 3128 max score). The total cost of the run was $0.0996, with 34051 prompt tokens and 159768 completion tokens used.

Average latency was 2.48s, with a median of 1.48s. The model showed strong output speed, averaging 180.22 tok/s, and a relatively low average time to first token (TTFT) of 0.37s.

Category-Level Performance

Performance varied significantly across categories:

Speed: The strongest category, with a pass rate of 90% (54/60) and a score of 95.56%. The model excelled particularly in hard-level tests, achieving a perfect score.
Cost: Also highly effective, with a pass rate of 96.67% (29/30) and a score of 90.67%. All easy and medium sub-levels were passed perfectly.
Coding: Solid performance at 73.33% pass rate (44/60) and 89.27% score. The model achieved a perfect pass rate on easy coding tasks.
UI: Good results with 77.78% pass rate (7/9) and 77.06% score.
Hallucination: Moderate performance at 41.67% pass rate (25/60) and 65% score. Performance improved on harder tasks, suggesting better handling of complex API/edge case knowledge.
Debugging: Struggled significantly with only 19/60 (31.67%) passed and 67.92% score. Performance was lowest on easy and medium levels.
Refactoring: Very low pass rate of 16.67% (10/60) and 62.48% score. Despite moderate score percentage, few tests were fully passed.
Reasoning: Poor pass rate of 22.03% (13/59) but a higher score of 69.17%, indicating partial credit on many failed tasks.
Security: The weakest category, with only 6/60 (10%) passed and 46.09% score. All easy-level tests failed except one.

Notable Observations

The model showed a cost-awareness strength, especially in cost category tests involving refactoring, generation, and analysis, achieving near-perfect scores.
Latency varied by category: ui had the highest average TTFT (1.07s), while speed and debugging were fastest (~0.29s and 0.29s respectively).
Two debugging tests (debug-prototype-pollution-check-v2, debug-prototype-pollution-merge-v2) resulted in errors with message "Spread syntax requires ...iterable[Symbol.iterator] to be a function", indicating a possible issue with handling specific JavaScript semantics.
Despite low pass rates in reasoning and security, the model earned substantial partial scores, suggesting it often produced plausible but incorrect reasoning traces or security assessments.
In hallucination tests, the model correctly avoided false claims about non-existent or incorrect API behaviors in several hard cases (e.g., halluc-api-fetch-timeout, halluc-doc-validation-pipe), showing some resistance to hallucination under complexity.

Models in this run

Per-model aggregates from overall_ranking.json for this run id.

ModelTestsPassScoreLatencyCost

mistralai/mistral-small-2603458207/45867.8%2.48s$0.10

blxbench & discovery

Values are read from report.json when the benchmark wrote them.

Test suite

v2 — Resilience

Discovery

Full suite discovery (no --limit)

blxbench argv

tui

App version

v1.3.2

Resumed run

Category performance (this run)

Score % vs mean latency where samples exist. Built from per-test rows in report.json when available.

By difficulty level

Pass vs fail

Score & success by cluster

Avg score % (bars) and strict success rate % (line) per cost cluster.

Latency distribution

Per-test latency (seconds), successful timings only.

Streaming shape (radar)

Normalized TTFT (inverted) vs decode tok/s per category for this run.

Metrics breakdown (avg %)

Average score % per metric dimension across all v2 tasks in this run.

Tests & cost by category

Tests per scope (blue bars), estimated spend per scope (green bars), and mean $ ÷ merged rows per category (cyan line).

All task results

Per-test rows from report.json → results — by category (collapsed by default), then by difficulty. COMPL from details when present. The Visual column is omitted when no test in this run has a details.visual score. Judge: verdict and overall (0–100) from judge_validation / validation_model for coding/UI (hover for summary and subscores). No HTML or screenshots in this table. Cost: per-task USD from cost_usd or usage.cost when recorded. Suite: same manifest version/hash for every row (this run).

459 tasks in 9 categories · Grouped by category, then by difficulty; row order within each table matchesreport.json results (benchmark execution order)