baidu/cobuddy:free12/19 in v2 — Resilience, score 68.7%, pass rate 47.1% (216/459)5.9s, output speed 50.6 tok/s$0.00, prompt tokens 26,225, completion tokens 318,70929/30 cost tests (96.7%) with zero cost.51/60 speed tests (85%), output speed 50.6 tok/s.4/9 passed despite small sample.5/60 passed (8.3%), lowest category pass rate.11/60 passed (18.3%), indicating limited multi-step logic.14/60 passed (23.3%), suggesting vulnerability to adversarial prompts.baidu/cobuddy:free (68.7%, 47.1%) trails openai/gpt-5.5 (77.9%, 61.9%), openai/gpt-5.3-codex (77.7%, 61.2%), and moonshotai/kimi-k2.6 (74.8%, 60.6%) in both score and pass rate.Ranked 12/19 in Resilience, baidu/cobuddy:free fits low-budget, non-critical coding tasks where cost matters more than robustness. Avoid for security-sensitive, reasoning-heavy, or refactoring workloads due to high failure rates. Deployment risk is moderate.
Overall score % from merged run_models rows (chronological). Only runs that include this model appear as points.
Score % vs pass rate % per category. With 0/1 scorers, both usually line up; with proportional tests, score % reflects partial credit while pass rate counts tests that clear the fixture threshold.
Total estimated spend per scope for this model (bars, left axis) and mean spend per merged result row (line, right axis: total ÷ tests).
Pass rate % per difficulty level — complements the score % view above.
Normalized 0–100 within this model: TTFT (shorter → higher spoke) and decode tok/s (higher → higher spoke). Values come from streamed BLXBench runs merged into overall_ranking.json.
Pass rate % per category for this model (distinct from score %, which reflects partial credit).