Training
AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining
The paper introduces Actor-Critic Online Data Mixing (AC-ODM), a novel approach to LLM pretraining that leverages reinforcement learning for dynamic data mixing. AC-ODM supports two modes: a proxy mode for transferring learned policies from smaller models to larger ones and a non-proxy mode for training from scratch. Empirical results demonstrate that AC-ODM achieves optimal validation perplexity on the Pythia-1B model with 66% fewer training steps than existing methods, yielding a 27.5% improvement in MMLU accuracy and a 2.23x higher pass@1 on HumanEval, all while maintaining minimal computational overhead.
LLMpretrainingreinforcement-learning