ai-digest.dev
last updated 4 min ago
InferencearXiv cs.CL 2 d ago

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

The article introduces SpenseGPT, a one-shot post-training pruning method that utilizes a hybrid sparse-dense format, enabling efficient use of semi-structured 2:4 sparsity in weight matrices. It achieves up to 1.2x end-to-end decoding speedup on Qwen3-32B and Seed-OSS-36B models on B200 GPUs with FP8 precision, while maintaining accuracy. This approach is significant for practitioners as it provides a practical solution for optimizing LLM inference without requiring specialized compiler support or sacrificing model performance.

llmpruninginferencerelevance 0.00 · engagement 0.00
Read at source ↗← all news
SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference — AI News Digest