Inference
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
StreamKL introduces a novel GPU primitive for efficiently computing Kullback-Leibler (KL) divergence in attention distillation, significantly reducing memory and I/O costs associated with existing methods. By employing an online formulation that streams query-key tiles through on-chip SRAM, StreamKL achieves up to 43x speedup in the forward pass and 14x in the backward pass, while minimizing the memory footprint from O(N_QN_K) to O(1). This advancement allows for long-context attention distillation on a single GPU, making it a critical tool for practitioners in knowledge distillation and model compression.
attention distillationkl divergenceoptimization