Research
Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates
The paper presents a significant advancement in Monte Carlo optimistic policy iteration (MC-O-PI) by proving that it can converge to optimality without requiring uniform initialization over the entire state-action space, but rather uniform updates only over actions within each state. This relaxation allows for more practical implementations in large or unknown state spaces, as episodes can start in various states at arbitrary frequencies. The findings provide a new analytical framework for studying optimistic policy iteration algorithms, which is crucial for practitioners aiming to enhance convergence properties in reinforcement learning applications.
policy-iterationreinforcement-learningconvergence