SafetyarXiv cs.AI — 21 h ago

Hidden Consensus:Preference-Validity Compression in Human Feedback

The paper introduces the concept of Preference-Validity Compression in Reinforcement Learning from Human Feedback (RLHF), arguing that reducing diverse human judgments to a single scalar reward can misrepresent alignment in culturally plural societies. An analysis of 321 preference events from Malaysian participants reveals that 79% of prompts have multiple majority-supported responses, indicating that traditional single-winner aggregation overlooks significant interpretive diversity. The authors advocate for alignment methods that maintain Validity-Preserving Consistency to better capture plural-valid responses, emphasizing the need for more nuanced feedback aggregation in AI systems.

rlhfhuman-feedbackalignmentrelevance 0.00 · engagement 0.00

Read at source ↗← all news