LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning

Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kailash Gogineni, Ali Suvizi, Guru Venkataramani
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Open Journal of the Computer Society
Subjects:	Artificial intelligence large language models fine-tuning power efficiency
Online Access:	https://ieeexplore.ieee.org/document/11037824/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849428433695670272
author	Kailash Gogineni Ali Suvizi Guru Venkataramani
author_facet	Kailash Gogineni Ali Suvizi Guru Venkataramani
author_sort	Kailash Gogineni
collection	DOAJ
description	Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.
format	Article
id	doaj-art-4ce272f407554d32a49644ce998a1bea
institution	Kabale University
issn	2644-1268
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of the Computer Society
spelling	doaj-art-4ce272f407554d32a49644ce998a1bea2025-08-20T03:28:43ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-016987100010.1109/OJCS.2025.358049811037824LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-TuningKailash Gogineni0https://orcid.org/0000-0003-1865-5470Ali Suvizi1https://orcid.org/0000-0002-9338-6082Guru Venkataramani2https://orcid.org/0000-0002-7084-7560George Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USALarge Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.https://ieeexplore.ieee.org/document/11037824/Artificial intelligencelarge language modelsfine-tuningpower efficiency
spellingShingle	Kailash Gogineni Ali Suvizi Guru Venkataramani LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning IEEE Open Journal of the Computer Society Artificial intelligence large language models fine-tuning power efficiency
title	LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_full	LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_fullStr	LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_full_unstemmed	LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_short	LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_sort	llms on a budget system level approaches to power efficient and scalable fine tuning
topic	Artificial intelligence large language models fine-tuning power efficiency
url	https://ieeexplore.ieee.org/document/11037824/
work_keys_str_mv	AT kailashgogineni llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning AT alisuvizi llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning AT guruvenkataramani llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning

LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning

Similar Items