Training
Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback
The paper introduces On-Policy Co-Distillation (OPCoD), a novel training approach for multi-domain large language models (LLMs) where two models enhance each other's performance through peer feedback. This method enables mutual Pareto improvement, allowing each model to strengthen its capabilities across different domains without sacrificing its original strengths. The study demonstrates that OPCoD outperforms existing baselines on Science Q&A tasks, highlighting its effectiveness in optimizing model performance through cognizance-based gating and feedback anchoring techniques.
co-distillationllmpeer feedback