Safety
Hidden Consensus:Preference-Validity Compression in Human Feedback
The paper introduces the concept of Preference-Validity Compression in Reinforcement Learning from Human Feedback (RLHF), arguing that reducing diverse human judgments to a single scalar reward can misrepresent alignment in culturally plural societies. An analysis of 321 preference events from Malaysian participants reveals that 79% of prompts have multiple majority-supported responses, indicating that traditional single-winner aggregation overlooks significant interpretive diversity. The authors advocate for alignment methods that maintain Validity-Preserving Consistency to better capture plural-valid responses, emphasizing the need for more nuanced feedback aggregation in AI systems.
rlhfhuman-feedbackalignment