TrainingarXiv cs.AI — 15 d ago

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

The paper presents a set of techniques aimed at reducing peak memory usage during Low-Rank Adaptation (LoRA) fine-tuning of Large Language Models (LLMs) on edge devices. Key innovations include base model quantization with on-the-fly dequantization, memory-efficient checkpointing, softmax approximation with token subsets, and logits masking, resulting in up to 26x and 28x reductions in peak memory for the Llama-3.2 3B and Qwen-2.5 3B models, respectively. These advancements are crucial for practitioners seeking to deploy LLMs on consumer hardware while maintaining performance and personalization.

fine-tuningloramemory-reductionrelevance 0.00 · engagement 0.00

Read at source ↗← all news