Agents
Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition
The paper presents the "Divide, Deliberate, Decide" framework for fine-grained egocentric action recognition using a multi-agent approach that operates fully locally and in a zero-shot manner. It employs a Vision-Language Model (VLM) orchestrator to segment videos and propose candidate labels, which are then refined through deliberation among diverse VLM specialists, culminating in a Borda count for ranking. This method enhances zero-shot performance by leveraging the diversity in model priors without requiring fine-tuning, making it significant for practitioners aiming to improve action recognition in nuanced visual contexts.
action recognitionmulti-agent