ResearcharXiv cs.CL — 2 d ago

Mechanistic Analysis of Alignment Algorithms in Language Models

The paper presents a mechanistic analysis of six alignment algorithms—PPO, DPO, SimPO, ORPO, GRPO, and KTO—evaluating their effects on language model internal computations across three open-weight model families. It reveals that while preference signals localize in early-mid or mid-late layers, the algorithms induce distinct geometric transformations, with KTO and GRPO enhancing linear separability, contrasting with DPO and ORPO's degrading effects. This analysis underscores the necessity for mechanism-aware optimization objectives and standardized auditing for safety and interpretability in alignment processes, informing practitioners on the heterogeneous impacts of different alignment strategies on model behavior.

alignmentmechanisticllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news