Agents
DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning
The article presents DiffAttn, a diffusion-based framework for predicting drivers' visual attention, which utilizes a Swin Transformer as an encoder and a Feature Fusion Pyramid decoder to enhance both local and global scene feature modeling. By incorporating a large language model (LLM) for top-down semantic reasoning, DiffAttn achieves state-of-the-art performance across four public datasets, significantly improving the interpretability of driver-centric scene understanding and enhancing safety-critical cue sensitivity. This advancement holds implications for intelligent vehicle systems, potentially improving human-machine interaction and risk perception.
visual attentionintelligent vehicles