Multimodal
See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View
The article introduces UAV-VLN-FOV, a new target-visible navigation task that isolates the see-and-reach stage for UAVs, facilitating a more precise evaluation of their ability to ground visible targets and execute 3D motion. It presents 3DG-VLN, a vision-language waypoint prediction framework that utilizes dynamic 3D direction cues and processes high-resolution front and downward views to enhance visual grounding and spatial alignment, resulting in a 13.82% improvement in success rate over existing UAV-VLN baselines. This advancement is significant for practitioners as it provides a dedicated benchmark and source code for developing more accurate navigation systems in UAV applications.
uavvision-languagenavigation