SafetyarXiv cs.AI — 21 h ago

Emergent alignment and the projectability of ethical personas

The paper introduces the concept of "emergent alignment" in LLMs, exploring how finetuning models on narrow safety tasks can induce aligned behavior across broader categories. Utilizing the "Constitutional AI" approach, four ethical frameworks were applied during finetuning, resulting in models that exhibit distinct "ethical personas" corresponding to their training constitution, with significant variation in performance across models. This research emphasizes the need for alignment strategies to be assessed not only on general safety outcomes but also on their ability to project ethical behavior consistently, which is critical for practitioners developing robust and reliable AI systems.

alignmentemergent alignmentllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news