LLM Compression by Block Removal with Constrained Binary Optimization
The paper presents a novel approach for compressing large language models (LLMs) by optimally removing transformer blocks through a constrained binary optimization (CBO) framework, likened to an Ising glass system. This method achieves significant performance improvements, including a 23 percentage point increase on the MMLU benchmark for 50% compression of Llama-3.3-70B-Instruct, and maintains competitive results for lighter compression across various models like Llama-3.1-8B-Instruct and Qwen3-14B. The approach is computationally efficient, applicable to any architecture, and demonstrates effectiveness even with heuristic solvers, making it a valuable technique for practitioners aiming to optimize model performance while reducing resource requirements.