Training
Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning
The paper presents ReRULE, an off-policy replay mechanism designed to enhance reinforcement unlearning in large language models (LLMs). By utilizing a replay buffer to store low-reward hard-case rollout groups, ReRULE enables more efficient training by focusing on challenging cases that require further learning, resulting in an improved Retain Quality score from 46.3 to 56.2 on the MUSE-Books benchmark with only a 5-11% increase in training time. This approach offers a significant advancement for practitioners by optimizing the unlearning process, particularly in scenarios with pronounced disparities between easy and hard cases.
llmreinforcementunlearning