CacheFormer: High-Attention-Based Segment Caching

Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs), have not fully resolved this problem. All these models strive to redu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sushant Singh, Ausif Mahmood
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	AI
Subjects:	deep learning natural language processing (NLP) large language models (LLMs) long-range modeling in LLMs
Online Access:	https://www.mdpi.com/2673-2688/6/4/85
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850155962350960640
author	Sushant Singh Ausif Mahmood
author_facet	Sushant Singh Ausif Mahmood
author_sort	Sushant Singh
collection	DOAJ
description	Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs), have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data are retrieved from the memory, but the adjacent data are also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our enhancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top-<i>k</i> high-attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms existing SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.
format	Article
id	doaj-art-83a471d26ffe4835abd41ee7baaee9ed
institution	OA Journals
issn	2673-2688
language	English
publishDate	2025-04-01
publisher	MDPI AG
record_format	Article
series	AI
spelling	doaj-art-83a471d26ffe4835abd41ee7baaee9ed2025-08-20T02:24:43ZengMDPI AGAI2673-26882025-04-01648510.3390/ai6040085CacheFormer: High-Attention-Based Segment CachingSushant Singh0Ausif Mahmood1Département of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USADépartement of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USAEfficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs), have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data are retrieved from the memory, but the adjacent data are also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our enhancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top-<i>k</i> high-attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms existing SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.https://www.mdpi.com/2673-2688/6/4/85deep learningnatural language processing (NLP)large language models (LLMs)long-range modeling in LLMs
spellingShingle	Sushant Singh Ausif Mahmood CacheFormer: High-Attention-Based Segment Caching AI deep learning natural language processing (NLP) large language models (LLMs) long-range modeling in LLMs
title	CacheFormer: High-Attention-Based Segment Caching
title_full	CacheFormer: High-Attention-Based Segment Caching
title_fullStr	CacheFormer: High-Attention-Based Segment Caching
title_full_unstemmed	CacheFormer: High-Attention-Based Segment Caching
title_short	CacheFormer: High-Attention-Based Segment Caching
title_sort	cacheformer high attention based segment caching
topic	deep learning natural language processing (NLP) large language models (LLMs) long-range modeling in LLMs
url	https://www.mdpi.com/2673-2688/6/4/85
work_keys_str_mv	AT sushantsingh cacheformerhighattentionbasedsegmentcaching AT ausifmahmood cacheformerhighattentionbasedsegmentcaching

CacheFormer: High-Attention-Based Segment Caching

Similar Items