ResearcharXiv cs.AI — 7 d ago

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

The study introduces the concept of module-wise weight-space geometry in transformer optimization, specifically analyzing the effects of different manifold constraints on GPT-2 pretraining. It finds that applying Stiefel geometry to attention layers and DGram geometry to MLP layers yields optimal performance, while uniform constraints lead to instability due to singular value growth in attention weights. This research underscores the importance of tailoring optimization strategies to specific transformer modules, enhancing the effectiveness of model training.

transformersoptimizationweight-space-geometryrelevance 0.00 · engagement 0.00

Read at source ↗← all news