AgentsarXiv cs.AI — 10 d ago

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

The article introduces S-SPPO (Semantic-Calibrated Self-Play Preference Optimization), a framework designed to enhance the stability of Self-Play Preference Optimization (SPPO) by addressing issues of policy degeneration in Large Language Models (LLMs) due to overly confident preference assignments. S-SPPO employs a dual-space semantic calibration approach that includes Supervision Calibration and Representation Calibration, which together ensure geometric diversity and prevent manifold collapse. Empirical results demonstrate that S-SPPO achieves a win rate of 52.19% and a length-controlled win rate of 47.46% on AlpacaEval 2.0 using the Llama-3-8B model, highlighting its effectiveness without requiring additional human-annotated preferences during training, making it relevant for practitioners aiming to align LLMs with human preferences more reliably.

context selectionllmdecision-makingrelevance 0.00 · engagement 0.00

Read at source ↗← all news