Coding
Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering
The paper presents an evaluation of LLM graders designed for K-12 educational assessments, utilizing context and prompt engineering with models such as Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The study, based on interrater agreement metrics like Quadratic Weighted Kappa and Proportional Reduction in Mean-Squared Error, shows that LLMs, particularly those with larger parameter sizes, can achieve significant agreement with human raters in mathematics and science, while facing challenges in English Language Arts. This research highlights the potential for LLMs to enhance grading efficiency and feedback quality, advocating for a hybrid approach that integrates AI with teacher expertise to improve assessment practices.
llmeducationgrading