Agents
Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
The article introduces Teach VLM, a vision-language model designed to convert mobile screen trajectories into operational knowledge by analyzing keyframes from demonstration videos. It addresses the challenge of diverse UI designs by utilizing a systematic data acquisition method and presents the Teach-and-Repeat paradigm, which enhances task automation for GUI agents. Extensive evaluations indicate that Teach VLM outperforms existing models in operation semantics prediction and improves Task Success Rates for downstream execution agents in Android environments, providing a significant advancement for practitioners in developing GUI automation systems.
gui agentsoperational knowledgemobile devices