Test fixture
Bug fixes, edge conditions, and minimal patch accuracy.
The model receives the prompt (and optional system message). The run uses scorer rubric_json_metrics with the JSON configuration below. Pass/fail and partial credit are determined entirely by that scorer against the model output; no human grading.
Return JSON only with keys diagnosis, fix, tests. A helper sorts an input array in place before returning top scores, causing callers to observe reordered data. Identify the mutation and fix it.
{
"metrics": {
"repro": {
"checks": [
{
"contains": [
"input array"
]
},
{
"contains": [
"reordered"
]
},
{
"contains": [
"sort"
]
},
{
"contains": [
"caller"
]
}
]
},
"hidden": {
"checks": [
{
"contains": [
"copy"
]
},
{
"contains": [
"toSorted"
]
},
{
"contains": [
"slice"
]
}
]
},
"diagnose": {
"checks": [
{
"contains": [
"in place"
]
},
{
"contains": [
"mutation"
]
},
{
"contains": [
"top scores"
]
}
]
}
}
}temperature
0
max_tokens
420
timeout (s)
120
type
scored
file
debug-array-sort-mutation-v2.json