Training
Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL
The paper introduces RefGRPO, an enhancement to standard reinforcement learning (RL) that addresses the reflection gap in LLM agents by incorporating a free calibration bonus derived from the agent's own performance assessments compared to actual outcomes. This approach significantly reduces underconfidence rates from 44.4% to 7.7% and improves task accuracy from 75.1% to 76.5% on text-to-SQL tasks across five benchmarks. This method allows agents to self-verify their outputs based on environmental feedback, facilitating more effective self-improvement and selective prediction during test time, which is crucial for practitioners developing LLM-based agents.
safe reinforcement learningpolicy optimization