Safety
Steerable Cultural Preference Optimization of Reward Models
The paper introduces a novel training algorithm, Steerable Cultural Preference Optimization (SCPO), aimed at developing reward models that accurately reflect diverse cultural preferences without excessive bias. SCPO demonstrates a performance improvement of up to 7 points for minority reward models compared to baseline models across the PRISM and GlobalOpinionQA datasets and shows a 280% increase in data efficiency over traditional full-data finetuning. This advancement is significant for practitioners as it enables more equitable and culturally sensitive AI systems, enhancing alignment with varied user preferences in large language models.
llmalignmentcultural-preferences