ResearcharXiv cs.AI — 15 d ago

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

The article introduces ThinkJEPA, a novel framework that integrates a Vision-Language Model (VLM) with latent world modeling to enhance long-horizon semantic reasoning in forecasting future states from video observations. It features a dual-temporal pathway consisting of a dense JEPA branch for fine-grained motion cues and a VLM "thinker" branch for broader semantic guidance, supported by a hierarchical representation extraction module. Experimental results demonstrate that ThinkJEPA outperforms existing VLM-only and JEPA-predictor baselines in hand-manipulation trajectory prediction, highlighting its potential for improving the robustness of long-term predictions in AI applications.

latent world modelsvision-language modelsrelevance 0.00 · engagement 0.00

Read at source ↗← all news