LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especi...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Open Journal of the Computer Society |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11037824/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849428433695670272 |
|---|---|
| author | Kailash Gogineni Ali Suvizi Guru Venkataramani |
| author_facet | Kailash Gogineni Ali Suvizi Guru Venkataramani |
| author_sort | Kailash Gogineni |
| collection | DOAJ |
| description | Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective. |
| format | Article |
| id | doaj-art-4ce272f407554d32a49644ce998a1bea |
| institution | Kabale University |
| issn | 2644-1268 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Open Journal of the Computer Society |
| spelling | doaj-art-4ce272f407554d32a49644ce998a1bea2025-08-20T03:28:43ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-016987100010.1109/OJCS.2025.358049811037824LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-TuningKailash Gogineni0https://orcid.org/0000-0003-1865-5470Ali Suvizi1https://orcid.org/0000-0002-9338-6082Guru Venkataramani2https://orcid.org/0000-0002-7084-7560George Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USALarge Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.https://ieeexplore.ieee.org/document/11037824/Artificial intelligencelarge language modelsfine-tuningpower efficiency |
| spellingShingle | Kailash Gogineni Ali Suvizi Guru Venkataramani LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning IEEE Open Journal of the Computer Society Artificial intelligence large language models fine-tuning power efficiency |
| title | LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning |
| title_full | LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning |
| title_fullStr | LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning |
| title_full_unstemmed | LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning |
| title_short | LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning |
| title_sort | llms on a budget system level approaches to power efficient and scalable fine tuning |
| topic | Artificial intelligence large language models fine-tuning power efficiency |
| url | https://ieeexplore.ieee.org/document/11037824/ |
| work_keys_str_mv | AT kailashgogineni llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning AT alisuvizi llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning AT guruvenkataramani llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning |