InferenceReddit r/LocalLLaMA — 15 d ago

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

The article discusses a setup for running the Qwen3.6-27B model in an inference-only context using four NVIDIA 5060 Ti GPUs, totaling approximately $1,800 in GPU costs. The configuration achieves a benchmark output token throughput of 55.67 tokens per second with a maximum context length of 262,144 tokens and utilizes a BF16 KV cache. This setup is significant for practitioners as it demonstrates a cost-effective way to leverage large language models for inference, highlighting the importance of efficient GPU utilization and configuration in real-time applications.

qwencontextperformancerelevance 0.00 · engagement 0.00

Read at source ↗← all news