Long-context inference optimization for large language models: a survey
With the rapid development of large language model (LLM) technology, the demand for processing long-text inputs has been increasing. However, long-text inference faces challenges such as high memory consumption and latency. To improve the efficiency of LLMs in long-text inference, a comprehensive re...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
China InfoCom Media Group
2025-01-01
|
| Series: | 大数据 |
| Subjects: | |
| Online Access: | http://www.j-bigdataresearch.com.cn/thesisDetails#10.11959/j.issn.2096-0271.2024xxx |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | With the rapid development of large language model (LLM) technology, the demand for processing long-text inputs has been increasing. However, long-text inference faces challenges such as high memory consumption and latency. To improve the efficiency of LLMs in long-text inference, a comprehensive review and analysis of existing optimization techniques were conducted. The study first revealed three key factors that affect efficiency: the first is the huge model size, the second is the attention mechanism operation with quadratic computational complexity, and the third is the autoregressive decoding strategy. These factors together restrict the overall performance of the model. Subsequently, a taxonomy was proposed, categorizing optimization techniques into model optimization, computation optimization, and system optimization, with detailed introductions to key technologies such as quantization, sparse attention, and operator fusion. The research results demonstrate that these optimization techniques can effectively enhance the performance of long-text inference. Finally, future research directions were outlined, emphasizing the importance of further optimizing LLMs for long-text inference to meet the growing demands of context length. |
|---|---|
| ISSN: | 2096-0271 |