google/gemini-3.5-flash23/23 in v2 — Resilience, score 41.48%, pass rate 23.53% (108/459)2.25s, output speed 338.04 tok/s$5.43, prompt tokens 27412, completion tokens 60498351/60 passed (85%), score 225/26113/30 passed (43.33%), score 85/15014/60 passed (23.33%), score 83/1800/60), score 155/6012/60 passed (3.33%), score 135/5415/60 passed (8.33%), score 115/36013/60 passed (21.67%), score 236/512openai/gpt-5.5 leads with 77.90% score and 61.87% pass rate, while gemini-3.5-flash trails at 41.48% score and 23.53% pass rate — a gap of over 18 percentage points in pass rate.Overall score % from merged run_models rows (chronological). Only runs that include this model appear as points.
Score % vs pass rate % per category. With 0/1 scorers, both usually line up; with proportional tests, score % reflects partial credit while pass rate counts tests that clear the fixture threshold.
Total estimated spend per scope for this model (bars, left axis) and mean spend per merged result row (line, right axis: total ÷ tests).
Pass rate % per difficulty level — complements the score % view above.
Normalized 0–100 within this model: TTFT (shorter → higher spoke) and decode tok/s (higher → higher spoke). Values come from streamed BLXBench runs merged into overall_ranking.json.
Pass rate % per category for this model (distinct from score %, which reflects partial credit).