Training
Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning
The article presents a new data-centric approach to enhance long-context reasoning in reinforcement learning (RL) for large language models, introducing a data recipe that includes eight curated datasets with approximately 14,000 examples. Experiments on models Qwen3-4B/8B/30B-A3B show significant benchmark improvements, with average gains of +7.2, +3.2, and +6.4 points across seven long-context benchmarks, and further enhancements in agentic tasks, improving GAIA by +4.8 and BrowseComp by +7.0 points. The release of these datasets aims to support further research and development in this area, emphasizing the importance of diverse training data over traditional reward engineering.
reinforcement-learninglong-contextdata