Research
Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages
The paper introduces ITEM, a large-scale benchmark for evaluating the reliability of 29 automatic metrics in machine translation (MT) and text summarization (TS) specifically for six major Indian languages. Key findings indicate that LLM-based evaluators align most closely with human judgments, highlighting the importance of addressing outlier impacts and the varying effectiveness of metrics in capturing content fidelity versus fluency. This work is significant for practitioners as it provides insights into improving evaluation practices for low-resource languages, enhancing the applicability of metrics in diverse linguistic contexts.
machine translationevaluationindian languages