Research
QPILOTS: Efficient Test-Time Q-Steering for Flow Policies
QPILOTS is a novel method for optimizing flow-matching and diffusion policies in reinforcement learning by steering the denoising process at inference time without modifying the original policy. It introduces two variants: QPILOTS-U, which uses a fast single-point approximation, and QPILOTS-M, which employs a learned auxiliary network for differentiable posterior sampling. Achieving an average success rate of 90% across 50 tasks on a standard offline-to-online RL benchmark, QPILOTS demonstrates superior performance in manipulating tasks with a large, frozen, pretrained Vision-Language Action (VLA) model, addressing challenges in numerical stability during policy extraction.
reinforcement-learningpoliciesQ-learning