InferencearXiv cs.AI — 4 d ago

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

The paper introduces Adaptive Visual Inference Scaling (AVIS), a method designed to optimize inference costs in Vision-Language Models (VLMs) by jointly adjusting Visual Context Scaling (VCS) and Visual Reasoning Scaling (VRS) for each query. AVIS employs Key Diversity Visual (KDV) pruning to efficiently reduce visual token redundancy and uses adaptive self-consistency to determine the number of reasoning rollouts based on a learned difficulty predictor. This approach enhances the accuracy-compute trade-off in various image and video reasoning benchmarks while maintaining low latency and compatibility with existing VLM architectures.

vision-languagescalingadaptiverelevance 0.00 · engagement 0.00

Read at source ↗← all news