TrainingarXiv cs.AI — 10 d ago

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

The paper introduces Actor-Critic Online Data Mixing (AC-ODM), a novel approach to LLM pretraining that leverages reinforcement learning for dynamic data mixing. AC-ODM supports two modes: a proxy mode for transferring learned policies from smaller models to larger ones and a non-proxy mode for training from scratch. Empirical results demonstrate that AC-ODM achieves optimal validation perplexity on the Pythia-1B model with 66% fewer training steps than existing methods, yielding a 27.5% improvement in MMLU accuracy and a 2.23x higher pass@1 on HumanEval, all while maintaining minimal computational overhead.

LLMpretrainingreinforcement-learningrelevance 0.00 · engagement 0.00

Read at source ↗← all news