EMLoC

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

¹ National Taiwan University, ² NVIDIA
NeurIPS 2025

Abstract

Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU—bringing efficient and practical model adaptation to individual users.

Method

Figure 1. Overview of proposed EMLoC. Stage 1: Construct a downstream-aware lightweight emulator. Stage 2: Fine-tune the emulator via LoRA, allowing reduced memory costs. Stage 3: Update the LoRA module to compensate the misalignment between the full model and emulator. (see Fig. 2)

Figure 2. LoRA correction. To alleviate the misalignment that arises from fine-tuning the lightweight emulator but running inference on the original model, LoRA parameters are corrected via feature spaces between the emulator and the original model.

Quantitative Results

Table 1. Performance comparisons of finetuning approaches on VQA with different memory costs.

Table 2. Applying EMLoC to 26B and 38B models. Note that a 5B-sized emulator is considered, and all experiments are conducted under a 24GB memory budget.

Table 3. Applying EMLoC to NLP tasks.

Table 4. Comparisons of emulator construction settings. Note that, previous works require external data and additional continual pretraining. We validate that both row pruning and SVD can be applied for our emulator construction with supperior performances using only downstream data.

Table 5. Comparison between EMLoC and QLoRA. Quantization-based methods offer limited flexibility in memory reduction and can degrade inference performance, while EMLoC enables fine-tuning on consumer-grade 24GB GPUs without affecting the base model's inference quality.

Table 6. Quantitative results of applying EMLoC to diffusion model. EMLoC achieves comparable, sometimes slightly lower, performance than direct fine-tuning while significantly reducing memory usage.

@article{lin2025emloc, title={EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction}, author={Hsi-Che Lin and Yu-Chu Yu and Kai-Po Chang and Yu-Chiang Frank Wang}, journal={arXiv preprint arXiv:2506.12015}, year={2025} }