InferencearXiv cs.AI — 4 d ago

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse is a mixed-precision kernel library designed for efficient quantized LLM inference on AMD XDNA2 NPUs, specifically targeting transformer linear layers. It introduces an interleaved pre-tiling layout and fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel, achieving performance improvements of up to 121.6% for GEMM and 281% for GEMV compared to full-precision baselines. This library enhances energy efficiency and reduces latency in end-to-end LLM tasks, demonstrating the viability of off-the-shelf quantization formats like AWQ on edge devices, which is crucial for practitioners aiming to deploy LLMs on client NPUs.

quantizationllmamdinferencerelevance 0.00 · engagement 0.00

Read at source ↗← all news