InferencearXiv cs.AI — 21 h ago

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

The paper introduces CLP (Collocation-Length Predictor), a novel approach for enhancing multi-token prediction (MTP) in large language models by mitigating head-backbone competition during autoregressive decoding. CLP employs a lightweight span-level decision layer with only 4.6K–7.7K parameters, achieving speedups of 1.20x–1.29x on 1.5B Qwen2.5 models and 1.14x–1.20x on 7B models without quality degradation (repetition ratio < 0.02), compared to prior gate-based methods that showed significant quality loss. This work provides a roadmap for improving MTP head prediction accuracy, critical for accelerating inference in large-scale models.

multi-tokeninferencelanguage modelsrelevance 0.00 · engagement 0.00

Read at source ↗← all news