Research
Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
The paper introduces ParaEval, a new evaluation framework designed to mitigate the sensitivity of multiple-choice question answering (MCQA) benchmarks to phrasing variations in answers. By employing multiple paraphrases for each answer option and scoring models based on their best-performing phrasing, ParaEval reduces the false performance gap observed in models ranging from 1B to 120B parameters, demonstrating improved reliability in assessing a model's true knowledge capabilities. This approach is significant for practitioners as it enhances the accuracy of model evaluations, ensuring that performance metrics reflect genuine understanding rather than superficial familiarity with specific phrasing.
evaluationllmmcqa