ResearcharXiv cs.AI — 9 d ago

Constitutional Value Potentials: reading and steering internal priority margins in language models

The paper introduces Constitutional Value Potentials (CVP), a method for assessing how well language models adhere to specified values by analyzing their internal activations rather than just their outputs. It demonstrates that a scalar potential can be learned for each value, providing a structured way to measure priority margins, with a monitor achieving an AUROC of 0.95 in predicting value conflict violations across three Qwen2.5 scales. This approach offers practitioners a novel framework for steering model behavior in alignment with specified values, potentially enhancing the interpretability and reliability of AI systems.

language modelsvalue alignmentsafetyrelevance 0.00 · engagement 0.00

Read at source ↗← all news