LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning

Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especi...

Full description

Saved in:
Bibliographic Details
Main Authors: Kailash Gogineni, Ali Suvizi, Guru Venkataramani
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of the Computer Society
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11037824/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849428433695670272
author Kailash Gogineni
Ali Suvizi
Guru Venkataramani
author_facet Kailash Gogineni
Ali Suvizi
Guru Venkataramani
author_sort Kailash Gogineni
collection DOAJ
description Large Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.
format Article
id doaj-art-4ce272f407554d32a49644ce998a1bea
institution Kabale University
issn 2644-1268
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of the Computer Society
spelling doaj-art-4ce272f407554d32a49644ce998a1bea2025-08-20T03:28:43ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-016987100010.1109/OJCS.2025.358049811037824LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-TuningKailash Gogineni0https://orcid.org/0000-0003-1865-5470Ali Suvizi1https://orcid.org/0000-0002-9338-6082Guru Venkataramani2https://orcid.org/0000-0002-7084-7560George Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USAGeorge Washington University, Washington, DC, USALarge Language Models (LLMs) have shown remarkable capabilities in various applications, including robotics, telecommunications, and scientific discovery. While much attention has been given to LLM inference and training phases, fine-tuning has received less focus despite its increasing cost, especially from a systems perspective. Fine-tuning is especially important for customizing compact models for edge applications, such as personal assistants running on local devices and models personalized with user-specific data, which in turn requires a deeper examination of fine-tuning performance and efficiency on single-GPU systems. Fine-tuning large models involves intensive matrix operations from backpropagation and gradient updates, which require extensive power and memory usage. In order to explore the range of performance optimization opportunities available to improve the LLM fine-tuning runtime, we understand the impact of techniques like activation checkpointing, low-rank adaptation, and operation fusion on LLM fine-tuning runtime optimization. In addition, we explore the effects of resource utilization through GPU peak power capping. Our experiments, conducted on NVIDIA RTX 4090 GPU using Meta’s LLaMA-3.1, Google’s Gemma, and Microsoft’s Phi-3, reveal that enabling all optimizations reduces memory usage by over 40% compared to FP32 baselines. Moreover, power capping to 300 W results in an average throughput drop of only 5.55% while reducing power consumption by 33%. Post-fine-tuning accuracy improvements on the Sycophancy Evaluation Benchmark range from 2% to 5%, depending on model architecture, validating that our optimization techniques preserve model quality while reducing resource requirements. Furthermore, we discuss several insights and potential future research directions from a systems perspective.https://ieeexplore.ieee.org/document/11037824/Artificial intelligencelarge language modelsfine-tuningpower efficiency
spellingShingle Kailash Gogineni
Ali Suvizi
Guru Venkataramani
LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
IEEE Open Journal of the Computer Society
Artificial intelligence
large language models
fine-tuning
power efficiency
title LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_full LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_fullStr LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_full_unstemmed LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_short LLMs on a Budget: System-Level Approaches to Power-Efficient and Scalable Fine-Tuning
title_sort llms on a budget system level approaches to power efficient and scalable fine tuning
topic Artificial intelligence
large language models
fine-tuning
power efficiency
url https://ieeexplore.ieee.org/document/11037824/
work_keys_str_mv AT kailashgogineni llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning
AT alisuvizi llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning
AT guruvenkataramani llmsonabudgetsystemlevelapproachestopowerefficientandscalablefinetuning