ResearcharXiv cs.AI — 9 d ago

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS is a novel method for optimizing flow-matching and diffusion policies in reinforcement learning by steering the denoising process at inference time without modifying the original policy. It introduces two variants: QPILOTS-U, which uses a fast single-point approximation, and QPILOTS-M, which employs a learned auxiliary network for differentiable posterior sampling. Achieving an average success rate of 90% across 50 tasks on a standard offline-to-online RL benchmark, QPILOTS demonstrates superior performance in manipulating tasks with a large, frozen, pretrained Vision-Language Action (VLA) model, addressing challenges in numerical stability during policy extraction.

reinforcement-learningpoliciesQ-learningrelevance 0.00 · engagement 0.00

Read at source ↗← all news