InferencearXiv cs.AI — 21 h ago

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing is a new push-forward language modeling paradigm introduced for joint next-k-token decoding, aiming to enhance the efficiency of autoregressive (AR) models in high-load batch serving scenarios. By distilling an existing AR model into a mapping that generates multiple future tokens in a single forward pass, K-Forcing achieves a speedup of approximately 2.4-3.5x when configured to generate 4 tokens simultaneously, while utilizing a standard causal Transformer architecture. This approach addresses the growing inference costs associated with large language models, making it a significant advancement for practitioners focusing on optimizing deployment efficiency.

language modelingdecodingefficiencyrelevance 0.00 · engagement 0.00

Read at source ↗← all news