Training
Decentralized Autoregressive Generation
This work establishes the theoretical equivalence between decentralized and centralized training for autoregressive generation, utilizing the Discrete Flow Matching framework to show that global models can be decomposed into independent experts. Extensive experiments across various multimodal benchmarks demonstrate that decentralized training achieves competitive performance compared to traditional centralized architectures. This finding is significant for practitioners as it suggests viable pathways for scaling autoregressive models while maintaining efficiency and performance.
decentralizedautoregressivetraining