Inference
Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp
The Pull Request #22645 introduces a modification to the Top-N-Sigma sampler in the ggml-org/llama.cpp repository, eliminating the unconditional softmax and sort operations that were previously performed at the end of the sampling process. This change resulted in a performance improvement, increasing throughput from approximately 30 tokens per second (t/s) to 45 t/s on a MacBook Pro M3 Max, thereby reducing the time per token by 10 milliseconds. This enhancement is significant for practitioners as it optimizes the sampling process, particularly when Top-N-Sigma is used in conjunction with other samplers, potentially leading to more efficient model inference.
Top-N-Sigmasamplingoptimization