Multimodal
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
The article introduces GRASP (Grounded Reasoning and Symbolic Planning), a framework that enables robots to perform tabletop manipulation tasks based on natural-language prompts without extensive training. By utilizing a pretrained Vision-Language Model (VLM) and a bounding-box detection pipeline, GRASP translates language into neuro-symbolic goal states, achieving a 73.3% success rate across 90 real-robot trials at varying difficulty levels. This advancement reduces the computational burden and enhances the adaptability of robots in dynamic environments, making it significant for practitioners working with language-conditioned robotic systems.
roboticsvision-languagetask planning