Multimodal
Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language Models
The paper presents a novel fine-tuning approach for vision-language models, specifically Gemma 4 12B and Qwen3-VL-8B, utilizing dense coordinate lists to enhance visual grounding while managing structured output behaviors. The adaptation, which employs high-capacity low-rank adaptation (LoRA), significantly improves class-aware F1 scores from 0.007 to 0.448, while controlling duplicate outputs, achieving a duplicate rate of 0.000 and maintaining high performance metrics. This method introduces a controllable interference surface that allows practitioners to better manage output quality and structure in vision-language tasks, thereby enhancing model reliability in real-world applications.
vision-languagefine-tuning