Research
JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetFlow introduces a novel head-based speculative decoding framework that enhances the efficiency of autoregressive Large Language Models (LLMs) by utilizing branch-wise causal conditioning over fused hidden states from a frozen target model. This approach allows for larger draft budgets to translate into longer accepted prefixes and improved end-to-end speed, achieving performance improvements of up to 9.64x on MATH-500 and 4.58x on conversational tasks compared to existing methods. The framework's integration with vLLM further optimizes latency under realistic serving conditions, making it a significant advancement for practitioners aiming to accelerate LLM inference.
speculative decodingllmscaling