The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
The article introduces Auxiliary Particle Power Sampling (APPS), a novel blockwise particle algorithm designed to enhance inference efficiency in large language models (LLMs) by targeting the sequence-level power distribution with a bounded population of partial solutions. APPS utilizes proposal-corrected power reweighting and future-value-guided selection to effectively manage competing hypotheses, allowing for flexible particle counts and predictable memory usage. This approach improves the accuracy-runtime trade-off in training-free decoding, suggesting that inference-time power approximation can yield performance improvements typically associated with post-training adjustments, which is crucial for practitioners optimizing LLM deployment.