Open SourceReddit r/LocalLLaMA — 14 d ago

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

A fork of the ik_llama.cpp repository has been created, introducing a `--numa mirror` mode aimed at enhancing inference performance on multi-socket CPU systems by duplicating model weights and KV cache across CPU sockets. This approach mitigates performance penalties associated with remote memory accesses by ensuring each CPU socket has local access to the necessary data, thereby utilizing all CPU cores effectively. While this method requires double the RAM for dual-socket systems, initial benchmarks indicate significant performance gains, highlighting its potential for improving LLM inference on multi-socket architectures.

llamaperformancerelevance 0.00 · engagement 0.00

Read at source ↗← all news