Research
Emergent Alignment
The paper introduces a new approach for aligning Large Language Models (LLMs) with human ethics through a "conscience step" that reviews their own outputs. It employs Direct Preference Optimization (DPO) to enhance the training loss with an alignment component, allowing the model to self-correct without external judges. This method demonstrates effectiveness in various scenarios, including training, fine-tuning, adversarial prompting, and zero-shot learning, addressing previously observed unethical behaviors in LLMs.
llmalignmentethics