TrainingarXiv cs.CL — 12 d ago

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

The article presents SParse Expert Synchronization (SPES), a decentralized framework for pretraining mixture-of-experts (MoE) large language models (LLMs) that significantly reduces memory usage by training only a subset of experts per node. Utilizing 16 standalone 48GB GPUs, SPES successfully trains a 2 billion-parameter MoE model, achieving competitive performance with centrally trained models, and demonstrates scalability with a 7 billion-parameter model and a 9 billion-parameter model derived from a dense checkpoint. This approach is crucial for practitioners as it enables efficient training of large models in distributed environments without the need for extensive centralized resources.

pretrainingdistributed-gpusmemory-efficientrelevance 0.00 · engagement 0.00

Read at source ↗← all news