Research
Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
The article introduces the Attention-Discounted Adaptive Sampler (ADAS), a training-free reranking method designed to enhance parallel decoding in masked diffusion language models. ADAS modifies subset construction by applying a soft marginal penalty based on attention, improving low-NFE performance by an average of 9.11 and 10.46 percentage points when integrated with existing samplers like Top-k, Fast-dLLM, and EB-Sampler, while maintaining a runtime overhead of 3.1%. This advancement is significant for practitioners as it provides a modular approach to optimize inference quality in masked diffusion models without altering the base sampler's stopping rules.
masked diffusionlanguage modelssampling