Training
Small Initialization Matters for Large Language Models
The paper presents findings on the impact of parameter initialization on the training of large language models (LLMs), emphasizing that smaller initialization scales enhance pretraining performance, particularly on reasoning tasks. It introduces a $\gamma$-initialization rule that advocates for using small initialization as a default setting, which leads to improved model capacity and reasoning capabilities by guiding parameters through a developmental trajectory from low-complexity structures to richer representations. This research highlights a critical factor in LLM training that can be leveraged to optimize performance with minimal cost, making it significant for practitioners focused on model efficiency and effectiveness.
initializationlarge language modelscapacity