Research
Native Active Perception as Reasoning for Omni-Modal Understanding
OmniAgent, a novel omni-modal agent, has been introduced for video understanding, utilizing a POMDP-based iterative Observation-Thought-Action cycle to enhance efficiency by selectively processing audio-visual cues. Key innovations include Agentic Supervised Fine-Tuning for active perception and Agentic Reinforcement Learning with TAURA, which improves performance through targeted reasoning turns. Empirical results show that OmniAgent, with a model size of 7 billion parameters, achieves state-of-the-art results on benchmarks like LVBench, outperforming the significantly larger Qwen2.5-VL-72B model, which is crucial for practitioners aiming to optimize computational resources while enhancing model performance in video analysis tasks.
video understandingreasoningactive perception