Training
Improved Baselines with Representation Autoencoders
The paper presents Representation Autoencoders version 2 (RAEv2), which enhances the original RAE by utilizing a generalized formulation that aggregates multiple encoder layers for improved reconstruction and demonstrates complementary interactions between RAE and Representation Alignment (REPA). RAEv2 achieves a state-of-the-art generalized Fréchet Inception Distance (gFID) of 1.06 on ImageNet-256 in just 80 epochs, significantly faster than the original RAE, and introduces a new training efficiency metric, EPFID@k. This advancement is particularly relevant for practitioners focusing on efficient training and performance in text-to-image generation and other applications involving pretrained vision encoders.
autoencodersrepresentationimprovement