Research
Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
This study investigates the performance of large language models (LLMs) in title-abstract screening for systematic reviews, analyzing over 1,000 papers across six software engineering reviews. The Kappa values for agreement between LLMs and human experts ranged from 0.52 to 0.77, indicating moderate to substantial agreement but highlighting significant reliability issues due to factors like boundary ambiguity and incorrect topic inference. The authors propose actionable recommendations for improving LLM deployment in this context, which are crucial for practitioners aiming to enhance the accuracy and reliability of LLMs in systematic review processes.
llmscreeningsystematic reviews