Entropy-Guided KV Caching for Efficient LLM Inference

Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associat...

Full description

Saved in:
Bibliographic Details
Main Authors: Heekyum Kim, Yuchul Jung
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/15/2366
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849239520043597824
author Heekyum Kim
Yuchul Jung
author_facet Heekyum Kim
Yuchul Jung
author_sort Heekyum Kim
collection DOAJ
description Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.
format Article
id doaj-art-8b5f893e0d4642be98b40e1672c8e4d7
institution Kabale University
issn 2227-7390
language English
publishDate 2025-07-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-8b5f893e0d4642be98b40e1672c8e4d72025-08-20T04:00:55ZengMDPI AGMathematics2227-73902025-07-011315236610.3390/math13152366Entropy-Guided KV Caching for Efficient LLM InferenceHeekyum Kim0Yuchul Jung1Department of Computer Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaDepartment of AI Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaLarge language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.https://www.mdpi.com/2227-7390/13/15/2366LLMKV cachetransformerLLM inference optimizationattention entropymemory-efficient caching
spellingShingle Heekyum Kim
Yuchul Jung
Entropy-Guided KV Caching for Efficient LLM Inference
Mathematics
LLM
KV cache
transformer
LLM inference optimization
attention entropy
memory-efficient caching
title Entropy-Guided KV Caching for Efficient LLM Inference
title_full Entropy-Guided KV Caching for Efficient LLM Inference
title_fullStr Entropy-Guided KV Caching for Efficient LLM Inference
title_full_unstemmed Entropy-Guided KV Caching for Efficient LLM Inference
title_short Entropy-Guided KV Caching for Efficient LLM Inference
title_sort entropy guided kv caching for efficient llm inference
topic LLM
KV cache
transformer
LLM inference optimization
attention entropy
memory-efficient caching
url https://www.mdpi.com/2227-7390/13/15/2366
work_keys_str_mv AT heekyumkim entropyguidedkvcachingforefficientllminference
AT yuchuljung entropyguidedkvcachingforefficientllminference