PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
PerceptionDLM is a newly proposed multimodal diffusion language model designed for efficient parallel region perception, overcoming limitations of existing autoregressive models in handling multiple region captioning tasks. Built on PerceptionDLM-Base, it employs efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, resulting in significant improvements in inference speed. The introduction of the Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) allows for comprehensive evaluation of caption quality and efficiency, demonstrating PerceptionDLM's competitive performance in multi-region tasks and underscoring its potential for practitioners in AI visual perception applications.