Training
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
The paper introduces Prefilling-dLLM, a framework designed to optimize long-context inference in diffusion language models (dLLMs) by partitioning the input prefix into N chunks and caching their key-value (KV) representations. This method reduces computational complexity from quadratic in the full sequence length to quadratic only in the decode length, achieving state-of-the-art performance on benchmarks like LongBench and InfiniteBench, with speedups of 9.1–28.0x for 8K–32K contexts. The findings are significant for practitioners as they enable efficient handling of long contexts in dLLMs, improving both speed and resource utilization.
diffusionllminference