ModelsarXiv cs.AI — 10 d ago

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

The paper introduces Expert Tying, an architectural modification for Mixture-of-Experts (MoE) language models that allows for the sharing of expert parameters across consecutive transformer layers while maintaining independent routing and attention. Evaluated on state-of-the-art architectures like OLMoE, Qwen3, and DeepSeek, this method can nearly halve the memory footprint without significant degradation in perplexity or downstream performance. This advancement enhances the efficiency of training and scaling large language models, offering a better compute-to-memory trade-off for practitioners.

mixture-of-expertslarge language modelsexpert tyingrelevance 0.00 · engagement 0.00

Read at source ↗← all news