Inference
UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators
UltraSketchLLM introduces a novel approach for compressing large language models (LLMs) to a peak memory footprint of 0.5 bits per weight using data sketch techniques. This method achieves a significant 14.9x speedup over traditional sketch solutions while maintaining acceptable performance levels, making it suitable for deployment in resource-constrained environments. The hardware-friendly implementation of UltraSketchLLM is particularly relevant for practitioners seeking to optimize LLMs for efficiency without compromising performance.
compressionllm