Research
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
The article introduces ThinkJEPA, a novel framework that integrates a Vision-Language Model (VLM) with latent world modeling to enhance long-horizon semantic reasoning in forecasting future states from video observations. It features a dual-temporal pathway consisting of a dense JEPA branch for fine-grained motion cues and a VLM "thinker" branch for broader semantic guidance, supported by a hierarchical representation extraction module. Experimental results demonstrate that ThinkJEPA outperforms existing VLM-only and JEPA-predictor baselines in hand-manipulation trajectory prediction, highlighting its potential for improving the robustness of long-term predictions in AI applications.
latent world modelsvision-language models