Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning
The article introduces the Momentum-Guided Semantic Forecasting (MoFore) framework for self-supervised video representation learning, which focuses on forecasting future latent embeddings from temporally distant context clips rather than relying on pixel-level reconstruction or semantic alignment. Key innovations include randomized temporal-gap forecasting to enhance robustness across temporal scales and the combination of predictive latent forecasting with contrastive regularization to ensure temporal consistency. Experiments on the UCF101 dataset show that MoFore effectively learns temporally consistent and semantically meaningful representations without the need for action labels, highlighting its potential for efficient video representation learning in AI applications.