Inference
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
The article introduces SpenseGPT, a one-shot post-training pruning method that utilizes a hybrid sparse-dense format, enabling efficient use of semi-structured 2:4 sparsity in weight matrices. It achieves up to 1.2x end-to-end decoding speedup on Qwen3-32B and Seed-OSS-36B models on B200 GPUs with FP8 precision, while maintaining accuracy. This approach is significant for practitioners as it provides a practical solution for optimizing LLM inference without requiring specialized compiler support or sacrificing model performance.
llmpruninginference