Max–Min semantic chunking of documents for RAG application
Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semanti...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-06-01
|
| Series: | Discover Computing |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s10791-025-09638-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850102491391197184 |
|---|---|
| author | Csaba Kiss Marcell Nagy Péter Szilágyi |
| author_facet | Csaba Kiss Marcell Nagy Péter Szilágyi |
| author_sort | Csaba Kiss |
| collection | DOAJ |
| description | Abstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant. |
| format | Article |
| id | doaj-art-42e1da45be7f4f76bc3fd4f5c329566b |
| institution | DOAJ |
| issn | 2948-2992 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Springer |
| record_format | Article |
| series | Discover Computing |
| spelling | doaj-art-42e1da45be7f4f76bc3fd4f5c329566b2025-08-20T02:39:44ZengSpringerDiscover Computing2948-29922025-06-0128111510.1007/s10791-025-09638-7Max–Min semantic chunking of documents for RAG applicationCsaba Kiss0Marcell Nagy1Péter Szilágyi2Department of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsDepartment of Stochastics, Institute of Mathematics, Budapest University of Technology and EconomicsNokia Bell LabsAbstract Retrieval-augmented generation (RAG) systems have emerged as a powerful approach to enhance large language model (LLM) outputs, however, their effectiveness heavily depends on document chunking strategies. Current methods, often arbitrary or size-based segmentation, fail to preserve semantic coherence, leading to suboptimal retrieval and reduced output quality. To overcome this limitation, we introduce Max–Min semantic chunking, a novel method utilizing semantic similarity and a Max–Min algorithm to identify semantically coherent text. We evaluated our approach on three distinct datasets, assessing clustering efficiency via adjusted mutual information (AMI) and generation coherence through accuracy on a RAG-based multiple-choice question answering test. Across the datasets, Max–Min semantic chunking achieved superior performance with average AMI scores of 0.85, 0.90, and an average accuracy of 0.56 (averaged across LLMs). This significantly outperformed the next best method, the Llama Semantic Splitter (AMI: 0.68, 0.70; accuracy: 0.53). The improvements in the AMI scores were statistically significant.https://doi.org/10.1007/s10791-025-09638-7Natural language processing (NLP)Retrieval augmented generation (RAG)Document and text processingLarge language models (LLMs)Semantic chunking |
| spellingShingle | Csaba Kiss Marcell Nagy Péter Szilágyi Max–Min semantic chunking of documents for RAG application Discover Computing Natural language processing (NLP) Retrieval augmented generation (RAG) Document and text processing Large language models (LLMs) Semantic chunking |
| title | Max–Min semantic chunking of documents for RAG application |
| title_full | Max–Min semantic chunking of documents for RAG application |
| title_fullStr | Max–Min semantic chunking of documents for RAG application |
| title_full_unstemmed | Max–Min semantic chunking of documents for RAG application |
| title_short | Max–Min semantic chunking of documents for RAG application |
| title_sort | max min semantic chunking of documents for rag application |
| topic | Natural language processing (NLP) Retrieval augmented generation (RAG) Document and text processing Large language models (LLMs) Semantic chunking |
| url | https://doi.org/10.1007/s10791-025-09638-7 |
| work_keys_str_mv | AT csabakiss maxminsemanticchunkingofdocumentsforragapplication AT marcellnagy maxminsemanticchunkingofdocumentsforragapplication AT peterszilagyi maxminsemanticchunkingofdocumentsforragapplication |