Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU

Low-Rank Adaptation (LoRA) has garnered increasing attention for effectively fine-tuning large language models (LLMs) with limited resources. Nonetheless, conventional approaches that cater to multiple LoRA models independently lead to redundant computations and suboptimal GPU utilization. This stud...

Full description

Saved in:
Bibliographic Details
Main Authors: Lingnan Xia, Hua Ma
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10721583/
Tags: Add Tag
No Tags, Be the first to tag this record!