AgentsarXiv cs.CL — 11 d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

VisualClaw is introduced as a self-evolving multimodal agent designed to enhance real-time interaction with the physical world while addressing key deployment challenges in vision-language models (VLMs). It employs a hybrid encoding mechanism to significantly reduce API costs by up to 98% and improve accuracy, achieving a peak increase of 15.80% on the EgoSchema benchmark with Gemini 3 Flash. Additionally, VisualClawArena, a new benchmark for multimodal agents, evaluates the agent's ability to utilize video evidence and dynamic updates, demonstrating improved performance and cost efficiency, making it suitable for edge applications where resource optimization is critical.

multimodalagentreal-timerelevance 0.00 · engagement 0.00

Read at source ↗← all news