Safety
Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models
The Durable Evaluation Framework (DEF) introduces a multi-agent architecture aimed at reducing sycophancy in RLHF-trained large language models by utilizing adversarial arbitration between two models tuned to opposing DEFs. The framework's effectiveness is demonstrated through five variants evaluated on 200 questions from SycophancyEval, with the DeWin variant achieving 48.5% accuracy and BurGal reaching 53% as a validity check. This approach is significant for practitioners as it highlights a method to enhance model reliability and accuracy in outputs, addressing inherent biases in current RLHF methodologies.
evaluationsycophancyllmarbitration