Inference
S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
The article introduces S2D2, a training-free self-speculative decoding framework designed for block-diffusion language models, which enhances decoding speed without additional training or significant test-time compute. S2D2 allows a pretrained block-diffusion model to function as both drafter and verifier by reducing block size to one, resulting in a hybrid decoding method that improves the accuracy-speed tradeoff. Benchmark results indicate S2D2 achieves up to 4.7× speedup over autoregressive decoding and up to 1.57× over dynamic baselines, while enhancing accuracy by up to 4.5 points, making it a valuable tool for practitioners seeking efficient LLM generation.
diffusion-modelsdecodingself-speculation