Inference
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling
K-Forcing is a new push-forward language modeling paradigm introduced for joint next-k-token decoding, aiming to enhance the efficiency of autoregressive (AR) models in high-load batch serving scenarios. By distilling an existing AR model into a mapping that generates multiple future tokens in a single forward pass, K-Forcing achieves a speedup of approximately 2.4-3.5x when configured to generate 4 tokens simultaneously, while utilizing a standard causal Transformer architecture. This approach addresses the growing inference costs associated with large language models, making it a significant advancement for practitioners focusing on optimizing deployment efficiency.
language modelingdecodingefficiency