Agents
Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization
The paper introduces Semantic Flip, a framework for generating synthetic out-of-distribution (OOD) samples to improve refusal capabilities in embodied vision-language models (VLMs) during tasks like Embodied Question Answering and spatial localization. By transforming queries and video memory to create OOD pairs, it enables the training of a lightweight rejection module that can be integrated into existing VLM pipelines without retraining. The approach demonstrates superior performance on two benchmarks, achieving an F1 score of 0.9559 on the newly introduced SpaceReject benchmark, highlighting its significance for enhancing the reliability of embodied agents in real-world applications.
embodied agentsquestion answeringspatial reasoning