Research
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
The study introduces the concept of module-wise weight-space geometry in transformer optimization, specifically analyzing the effects of different manifold constraints on GPT-2 pretraining. It finds that applying Stiefel geometry to attention layers and DGram geometry to MLP layers yields optimal performance, while uniform constraints lead to instability due to singular value growth in attention weights. This research underscores the importance of tailoring optimization strategies to specific transformer modules, enhancing the effectiveness of model training.
transformersoptimizationweight-space-geometry