Entropy-Guided KV Caching for Efficient LLM Inference

Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associat...

Full description

Saved in:

Bibliographic Details
Main Authors:	Heekyum Kim, Yuchul Jung
Format:	Article
Language:	English
Published:	MDPI AG 2025-07-01
Series:	Mathematics
Subjects:	LLM KV cache transformer LLM inference optimization attention entropy memory-efficient caching
Online Access:	https://www.mdpi.com/2227-7390/13/15/2366
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849239520043597824
author	Heekyum Kim Yuchul Jung
author_facet	Heekyum Kim Yuchul Jung
author_sort	Heekyum Kim
collection	DOAJ
description	Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.
format	Article
id	doaj-art-8b5f893e0d4642be98b40e1672c8e4d7
institution	Kabale University
issn	2227-7390
language	English
publishDate	2025-07-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj-art-8b5f893e0d4642be98b40e1672c8e4d72025-08-20T04:00:55ZengMDPI AGMathematics2227-73902025-07-011315236610.3390/math13152366Entropy-Guided KV Caching for Efficient LLM InferenceHeekyum Kim0Yuchul Jung1Department of Computer Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaDepartment of AI Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of KoreaLarge language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.https://www.mdpi.com/2227-7390/13/15/2366LLMKV cachetransformerLLM inference optimizationattention entropymemory-efficient caching
spellingShingle	Heekyum Kim Yuchul Jung Entropy-Guided KV Caching for Efficient LLM Inference Mathematics LLM KV cache transformer LLM inference optimization attention entropy memory-efficient caching
title	Entropy-Guided KV Caching for Efficient LLM Inference
title_full	Entropy-Guided KV Caching for Efficient LLM Inference
title_fullStr	Entropy-Guided KV Caching for Efficient LLM Inference
title_full_unstemmed	Entropy-Guided KV Caching for Efficient LLM Inference
title_short	Entropy-Guided KV Caching for Efficient LLM Inference
title_sort	entropy guided kv caching for efficient llm inference
topic	LLM KV cache transformer LLM inference optimization attention entropy memory-efficient caching
url	https://www.mdpi.com/2227-7390/13/15/2366
work_keys_str_mv	AT heekyumkim entropyguidedkvcachingforefficientllminference AT yuchuljung entropyguidedkvcachingforefficientllminference

Entropy-Guided KV Caching for Efficient LLM Inference

Similar Items