Training
Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training
The paper introduces a novel technique called query recycling for GRPO-style algorithms in reinforcement learning, which allows zero-variance queries to be resampled during training, enhancing their utility as the policy evolves. A 1.7B parameter model trained on synthetic data achieved a 66.0 average Pass@1 accuracy across seven multi-hop QA benchmarks, outperforming larger models with up to 7B parameters. This approach is significant for practitioners as it optimizes training efficiency by dynamically adapting the training distribution, potentially leading to more effective learning from limited query resources.
reinforcement learningzero-variancetraining