Agents
ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering
ScoutVLA is a novel Vision-Language-Action model designed for aerial Embodied Question Answering (EQA), featuring a dual-expert architecture that separates semantic reasoning from action generation. It incorporates a fine-grained active perception benchmark (FG-EQA) with over 40,000 simulated and 1,000 real-world trajectories, demonstrating significant performance improvements with a 10.48× higher average strict success rate and a 7.72× higher average QA correctness compared to existing systems. This model's approach to continuous viewpoint refinement and its decoupled training strategy are crucial for practitioners focusing on enhancing UAV capabilities in dynamic environments.
uavquestion answeringactive perception