TrainingarXiv cs.AI — 15 d ago

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

The article introduces AAPA (Adversarially Anchored Preference Alignment), a framework designed to enhance the post-training alignment of large language models by integrating a sentence-level adversarial anchoring signal. AAPA allows for the comparison of policy rollouts against pre-collected expert responses without requiring online teacher inference or co-training, and it can be applied to existing training methods like SFT, GRPO, and CHORD. Experiments demonstrate that AAPA improves performance on instruction-following benchmarks, achieving a 5.77% increase over a GRPO baseline on the Qwen3-0.6B model, highlighting its potential to provide stable semantic grounding for preference optimization in LLMs.

alignmentpost-trainingllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news