SafetyarXiv cs.CL — 2 d ago

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

This paper introduces a method using one-shot Group Relative Policy Optimization (GRPO) to demonstrate that a single biased example can induce systematic bias in large language models (LLMs), compromising their alignment. The findings reveal that models have varying susceptibility to bias based on their initial output tendencies, highlighting a significant vulnerability in post-training alignment processes. This research underscores the importance of robust bias mitigation strategies in the development and deployment of LLMs, as even minimal exposure to biased data can lead to widespread issues in model behavior.

biasalignmentllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news