LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
This study evaluates the ability of 42 proprietary and open-weight large language models (LLMs) to measure item discrimination in reading comprehension assessments, a critical psychometric property that distinguishes students of varying proficiency levels. Using two methods—direct discrimination prediction and response-based Classical Test Theory (CTT) calibration—the study finds that the best-performing model achieves a Spearman correlation of only 0.152 in direct prediction and 0.241 in response-based calibration. These results indicate that while LLMs contain some relevant signals for item discrimination, they currently fall short in reliably assessing how well items differentiate between students, highlighting a significant challenge for practitioners utilizing LLMs in educational assessment contexts.