Max–Min semantic chunking of documents for RAG application

Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semanti...

Full description

Saved in:
Bibliographic Details
Main Authors: Csaba Kiss, Marcell Nagy, Péter Szilágyi
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Discover Computing
Subjects:
Online Access:https://doi.org/10.1007/s10791-025-09638-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850102491391197184
author Csaba Kiss
Marcell Nagy
Péter Szilágyi
author_facet Csaba Kiss
Marcell Nagy
Péter Szilágyi
author_sort Csaba Kiss
collection DOAJ
description Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.
format Article
id doaj-art-42e1da45be7f4f76bc3fd4f5c329566b
institution DOAJ
issn 2948-2992
language English
publishDate 2025-06-01
publisher Springer
record_format Article
series Discover Computing
spelling doaj-art-42e1da45be7f4f76bc3fd4f5c329566b2025-08-20T02:39:44ZengSpringerDiscover Computing2948-29922025-06-0128111510.1007/s10791-025-09638-7Max–Min semantic chunking of documents for RAG applicationCsaba Kiss0Marcell Nagy1Péter Szilágyi2Department of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsDepartment of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsNokia Bell LabsAbstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.https://doi.org/10.1007/s10791-025-09638-7Natural language processing (NLP)Retrieval augmented generation (RAG)Document and text processingLarge language models (LLMs)Semantic chunking
spellingShingle Csaba Kiss
Marcell Nagy
Péter Szilágyi
Max–Min semantic chunking of documents for RAG application
Discover Computing
Natural language processing (NLP)
Retrieval augmented generation (RAG)
Document and text processing
Large language models (LLMs)
Semantic chunking
title Max–Min semantic chunking of documents for RAG application
title_full Max–Min semantic chunking of documents for RAG application
title_fullStr Max–Min semantic chunking of documents for RAG application
title_full_unstemmed Max–Min semantic chunking of documents for RAG application
title_short Max–Min semantic chunking of documents for RAG application
title_sort max min semantic chunking of documents for rag application
topic Natural language processing (NLP)
Retrieval augmented generation (RAG)
Document and text processing
Large language models (LLMs)
Semantic chunking
url https://doi.org/10.1007/s10791-025-09638-7
work_keys_str_mv AT csabakiss maxminsemanticchunkingofdocumentsforragapplication
AT marcellnagy maxminsemanticchunkingofdocumentsforragapplication
AT peterszilagyi maxminsemanticchunkingofdocumentsforragapplication