Research
RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering
The paper introduces RECOM (Reddit Evaluation for Correspondence of Models), a dataset designed for evaluating LLM-generated responses to open-ended questions on r/AskReddit, comprising 15,000 questions and their authentic community replies. It highlights a crucial tradeoff in automatic evaluation metrics between validity and discriminative power, revealing that existing metrics like cosine similarity and BERTScore struggle to effectively rank LLMs while maintaining validity. This work emphasizes the need for practitioners to report evaluation metrics on both axes to better assess model performance, as the validity-discrimination tradeoff is inherent to the metrics themselves rather than the models evaluated.
llmevaluationautomatic metrics