Entropy-Guided KV Caching for Efficient LLM Inference
Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associat...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Mathematics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7390/13/15/2366 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849239520043597824 |
|---|---|
| author | Heekyum Kim Yuchul Jung |
| author_facet | Heekyum Kim Yuchul Jung |
| author_sort | Heekyum Kim |
| collection | DOAJ |
| description | Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling. |
| format | Article |
| id | doaj-art-8b5f893e0d4642be98b40e1672c8e4d7 |
| institution | Kabale University |
| issn | 2227-7390 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Mathematics |
| spelling | doaj-art-8b5f893e0d4642be98b40e1672c8e4d72025-08-20T04:00:55ZengMDPI AGMathematics2227-73902025-07-011315236610.3390/math13152366Entropy-Guided KV Caching for Efficient LLM InferenceHeekyum Kim0Yuchul Jung1Department of Computer Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaDepartment of AI Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaLarge language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.https://www.mdpi.com/2227-7390/13/15/2366LLMKV cachetransformerLLM inference optimizationattention entropymemory-efficient caching |
| spellingShingle | Heekyum Kim Yuchul Jung Entropy-Guided KV Caching for Efficient LLM Inference Mathematics LLM KV cache transformer LLM inference optimization attention entropy memory-efficient caching |
| title | Entropy-Guided KV Caching for Efficient LLM Inference |
| title_full | Entropy-Guided KV Caching for Efficient LLM Inference |
| title_fullStr | Entropy-Guided KV Caching for Efficient LLM Inference |
| title_full_unstemmed | Entropy-Guided KV Caching for Efficient LLM Inference |
| title_short | Entropy-Guided KV Caching for Efficient LLM Inference |
| title_sort | entropy guided kv caching for efficient llm inference |
| topic | LLM KV cache transformer LLM inference optimization attention entropy memory-efficient caching |
| url | https://www.mdpi.com/2227-7390/13/15/2366 |
| work_keys_str_mv | AT heekyumkim entropyguidedkvcachingforefficientllminference AT yuchuljung entropyguidedkvcachingforefficientllminference |