LLMs as Teaching Assistants for Mathematics Exam Grading: Reliability, and Practical Usability

Open-ended mathematics exams are valuable because they assess reasoning, proof construction, algorithmic thinking, and communication of intermediate steps. They are also difficult to grade at scale because instructors must apply partial-credit rubrics consistently while giving feedback that helps students repair misconceptions. This paper evaluates six contemporary large language model (LLM) configurations, Gemini 3.1 Pro Extended, Gemini 3.5 Flash, ChatGPT 5.5 Pro Extended, ChatGPT 5.5 Thinking, Claude Pro Opus 4.7, and Claude Sonnet 4.6, as grading assistants for an undergraduate discrete mathematics examination. The study compares two grading policies. The BASELINE policy uses a stricter rubric-following prompt that emphasizes explicit evidence and complete justification. The LIBERAL policy was added after preliminary grading showed that the baseline condition sometimes applied harsh point deductions and failed to recognize valid partial reasoning. Agreement with human grading is measured at both the question and exam-total levels using mean absolute error, root mean squared error, normalized root mean squared error, Pearson correlation, and exact agreement. The results show that liberal partial-credit prompting reduces average question-level error for every evaluated model family. ChatGPT 5.5 Thinking (LIBERAL) has the lowest average question-level MAE (1.87) and RMSE (2.53), while Gemini 3.1 Pro Extended (LIBERAL) has the lowest total-score MAE (8.00) and RMSE (10.66). However, the strongest total-score Pearson correlation occurs under Gemini 3.1 Pro Extended (BASELINE) at 0.58, showing that point calibration and rank preservation remain distinct goals. We also report practical usability observations.