Test fixture
Arithmetic, symbolic steps, and structured problem solving.
The model receives the prompt (and optional system message). The run uses scorer exact_match with the JSON configuration below. Pass/fail and partial credit are determined entirely by that scorer against the model output; no human grading.
Set A = {2, 4, 6, 8, 10} and Set B = {1, 2, 3, 5, 8, 13}. How many elements are in A intersect B? Return only the number.{
"expected": "2"
}temperature
0
max_tokens
20
timeout (s)
120
type
scored
file
reasoning_medium_03.json