ResearcharXiv cs.AI — 14 d ago

Emergent Alignment

The paper introduces a new approach for aligning Large Language Models (LLMs) with human ethics through a "conscience step" that reviews their own outputs. It employs Direct Preference Optimization (DPO) to enhance the training loss with an alignment component, allowing the model to self-correct without external judges. This method demonstrates effectiveness in various scenarios, including training, fine-tuning, adversarial prompting, and zero-shot learning, addressing previously observed unethical behaviors in LLMs.

llmalignmentethicsrelevance 0.00 · engagement 0.00

Read at source ↗← all news