Training
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
The paper introduces a novel continual training approach for converting dense large language models (LLMs) into channel-sparse versions, specifically starting with the Qwen2.5-8B model. It employs a predictor-gated sparse SwiGLU feedforward network (FFN) during the 32K context training phase, utilizing a low-rank predictor for FFN-channel routing and applying a bank-wise top-k rule to achieve 4x sparsity. This method allows for optimization during training rather than relying on post-hoc sparsity, which is significant for practitioners aiming to enhance model efficiency while maintaining performance.
sparsityllmtraining