Long-context inference optimization for large language models: a survey

With the rapid development of large language model (LLM) technology, the demand for processing long-text inputs has been increasing. However, long-text inference faces challenges such as high memory consumption and latency. To improve the efficiency of LLMs in long-text inference, a comprehensive re...

Full description

Saved in:
Bibliographic Details
Main Authors: TAO Wei, WANG Jianzong, ZHANG Xulong, QU Xiaoyang
Format: Article
Language:zho
Published: China InfoCom Media Group 2025-01-01
Series:大数据
Subjects:
Online Access:http://www.j-bigdataresearch.com.cn/thesisDetails#10.11959/j.issn.2096-0271.2024xxx
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the rapid development of large language model (LLM) technology, the demand for processing long-text inputs has been increasing. However, long-text inference faces challenges such as high memory consumption and latency. To improve the efficiency of LLMs in long-text inference, a comprehensive review and analysis of existing optimization techniques were conducted. The study first revealed three key factors that affect efficiency: the first is the huge model size, the second is the attention mechanism operation with quadratic computational complexity, and the third is the autoregressive decoding strategy. These factors together restrict the overall performance of the model. Subsequently, a taxonomy was proposed, categorizing optimization techniques into model optimization, computation optimization, and system optimization, with detailed introductions to key technologies such as quantization, sparse attention, and operator fusion. The research results demonstrate that these optimization techniques can effectively enhance the performance of long-text inference. Finally, future research directions were outlined, emphasizing the importance of further optimizing LLMs for long-text inference to meet the growing demands of context length.
ISSN:2096-0271