Inference
SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
SPOT-E is a novel test-time method designed to enhance the performance of frozen vision-language models (VLMs) by optimizing question-conditioned visual spotlights through lightweight tuning with Group Relative Policy Optimization (GRPO). It introduces an entropy-shaping objective that balances answer-span prediction uncertainty while maintaining high-confidence tokens, resulting in improved robustness and consistent performance gains across various benchmarks and VLM families. This approach is significant for practitioners as it allows for enhanced grounding in evidence-intensive tasks without the need for retraining models.
visualizationvlmevidence