Inference
Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs
The article introduces Spiffy, a speculative decoding algorithm designed to enhance the inference speed of Diffusion LLMs (dLLMs) while maintaining output distribution integrity. It employs a novel directed draft graph structure, allowing for auto-speculation and dynamic pruning, which results in significant performance improvements, including up to 8.6× reduction in model inferences and 6.3× acceleration in token generation rates for models like LLaDA, Dream, and SDAR. This advancement is crucial for practitioners aiming to optimize the efficiency of dLLMs in real-time applications.
diffusion-llmspeculative-decodinginference