Training
Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
The article presents SParse Expert Synchronization (SPES), a decentralized framework for pretraining mixture-of-experts (MoE) large language models (LLMs) that significantly reduces memory usage by training only a subset of experts per node. Utilizing 16 standalone 48GB GPUs, SPES successfully trains a 2 billion-parameter MoE model, achieving competitive performance with centrally trained models, and demonstrates scalability with a 7 billion-parameter model and a 9 billion-parameter model derived from a dense checkpoint. This approach is crucial for practitioners as it enables efficient training of large models in distributed environments without the need for extensive centralized resources.
pretrainingdistributed-gpusmemory-efficient