Max–Min semantic chunking of documents for RAG application

Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semanti...

Full description

Saved in:

Bibliographic Details
Main Authors:	Csaba Kiss, Marcell Nagy, Péter Szilágyi
Format:	Article
Language:	English
Published:	Springer 2025-06-01
Series:	Discover Computing
Subjects:	Natural language processing (NLP) Retrieval augmented generation (RAG) Document and text processing Large language models (LLMs) Semantic chunking
Online Access:	https://doi.org/10.1007/s10791-025-09638-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850102491391197184
author	Csaba Kiss Marcell Nagy Péter Szilágyi
author_facet	Csaba Kiss Marcell Nagy Péter Szilágyi
author_sort	Csaba Kiss
collection	DOAJ
description	Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.
format	Article
id	doaj-art-42e1da45be7f4f76bc3fd4f5c329566b
institution	DOAJ
issn	2948-2992
language	English
publishDate	2025-06-01
publisher	Springer
record_format	Article
series	Discover Computing
spelling	doaj-art-42e1da45be7f4f76bc3fd4f5c329566b2025-08-20T02:39:44ZengSpringerDiscover Computing2948-29922025-06-0128111510.1007/s10791-025-09638-7Max–Min semantic chunking of documents for RAG applicationCsaba Kiss0Marcell Nagy1Péter Szilágyi2Department of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsDepartment of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsNokia Bell LabsAbstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.https://doi.org/10.1007/s10791-025-09638-7Natural language processing (NLP)Retrieval augmented generation (RAG)Document and text processingLarge language models (LLMs)Semantic chunking
spellingShingle	Csaba Kiss Marcell Nagy Péter Szilágyi Max–Min semantic chunking of documents for RAG application Discover Computing Natural language processing (NLP) Retrieval augmented generation (RAG) Document and text processing Large language models (LLMs) Semantic chunking
title	Max–Min semantic chunking of documents for RAG application
title_full	Max–Min semantic chunking of documents for RAG application
title_fullStr	Max–Min semantic chunking of documents for RAG application
title_full_unstemmed	Max–Min semantic chunking of documents for RAG application
title_short	Max–Min semantic chunking of documents for RAG application
title_sort	max min semantic chunking of documents for rag application
topic	Natural language processing (NLP) Retrieval augmented generation (RAG) Document and text processing Large language models (LLMs) Semantic chunking
url	https://doi.org/10.1007/s10791-025-09638-7
work_keys_str_mv	AT csabakiss maxminsemanticchunkingofdocumentsforragapplication AT marcellnagy maxminsemanticchunkingofdocumentsforragapplication AT peterszilagyi maxminsemanticchunkingofdocumentsforragapplication

Max–Min semantic chunking of documents for RAG application

Similar Items