Inference
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
The article introduces ReSET, a novel reasoning-step entropy-based temperature-scaling method designed to enhance the accuracy of NVFP4 inference in large reasoning models (LRMs) while mitigating the degradation of reasoning accuracy caused by quantization. The proposed method adapts decoding temperature based on both token-level and step-level entropy signals, achieving up to a 2-point accuracy improvement over the NVFP4 baseline. Additionally, a new CUDA-core small-$M$ NVFP4 kernel optimizes latency-critical autoregressive decoding, offering up to 2.5× speedup at the kernel level and approximately 2× end-to-end speedup compared to BF16, making it valuable for practitioners focusing on efficient inference in LLMs.
reasoninglatencytemperature scaling