Training
SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling
The article presents SVD-Partitioned Residual Initialization (SPRI), a novel method for upcycling pretrained dense models into sparse Mixture-of-Experts (MoE) models, particularly under data-constrained conditions. SPRI utilizes SVD-partitioned residuals from pretrained feed-forward network weights to enhance expert diversity while maintaining pretrained weight structure, coupled with a two-stage training strategy for improved adaptation stability. Evaluated on multilingual speech-to-text translation using the CoVoST2 dataset, SPRI achieved significant performance gains, improving BLEU and COMET scores over fully fine-tuned dense models and surpassing previous MoE upcycling methods.
moeupcyclingtraining