Training
Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices
The paper presents a set of techniques aimed at reducing peak memory usage during Low-Rank Adaptation (LoRA) fine-tuning of Large Language Models (LLMs) on edge devices. Key innovations include base model quantization with on-the-fly dequantization, memory-efficient checkpointing, softmax approximation with token subsets, and logits masking, resulting in up to 26x and 28x reductions in peak memory for the Llama-3.2 3B and Qwen-2.5 3B models, respectively. These advancements are crucial for practitioners seeking to deploy LLMs on consumer hardware while maintaining performance and personalization.
fine-tuningloramemory-reduction