3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
The article introduces 3D-RFT, a novel framework for video-based 3D scene understanding that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize model performance directly towards evaluation metrics rather than relying on supervised fine-tuning. Utilizing a 4 billion parameter model (3D-RFT-4B), the framework employs Group Relative Policy Optimization (GRPO) with task-specific reward functions based on metrics like 3D IoU and F1-Score, achieving state-of-the-art results on benchmarks for 3D video detection, visual grounding, and spatial reasoning, outperforming larger models such as VG LLM-8B. This development is significant for practitioners as it offers a more aligned training approach that enhances the reasoning capabilities of large models in 3D scene understanding tasks.