English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT

Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this p...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mai Alammar, Khalil El Hindi, Hend Al-Khalifa
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Computation
Subjects:	text chunking Arabic text chunking semantic chunking siamese network BERT semantic textual similarity
Online Access:	https://www.mdpi.com/2079-3197/13/6/151
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850156590490976256
author	Mai Alammar Khalil El Hindi Hend Al-Khalifa
author_facet	Mai Alammar Khalil El Hindi Hend Al-Khalifa
author_sort	Mai Alammar
collection	DOAJ
description	Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.
format	Article
id	doaj-art-916511a37c7744ccbce19ea625e6a6db
institution	OA Journals
issn	2079-3197
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Computation
spelling	doaj-art-916511a37c7744ccbce19ea625e6a6db2025-08-20T02:24:29ZengMDPI AGComputation2079-31972025-06-0113615110.3390/computation13060151English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERTMai Alammar0Khalil El Hindi1Hend Al-Khalifa2Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaSemantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.https://www.mdpi.com/2079-3197/13/6/151text chunkingArabic text chunkingsemantic chunkingsiamese networkBERTsemantic textual similarity
spellingShingle	Mai Alammar Khalil El Hindi Hend Al-Khalifa English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT Computation text chunking Arabic text chunking semantic chunking siamese network BERT semantic textual similarity
title	English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_full	English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_fullStr	English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_full_unstemmed	English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_short	English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_sort	english arabic hybrid semantic text chunking based on fine tuning bert
topic	text chunking Arabic text chunking semantic chunking siamese network BERT semantic textual similarity
url	https://www.mdpi.com/2079-3197/13/6/151
work_keys_str_mv	AT maialammar englisharabichybridsemantictextchunkingbasedonfinetuningbert AT khalilelhindi englisharabichybridsemantictextchunkingbasedonfinetuningbert AT hendalkhalifa englisharabichybridsemantictextchunkingbasedonfinetuningbert

English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT

Similar Items