English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT

Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this p...

Full description

Saved in:
Bibliographic Details
Main Authors: Mai Alammar, Khalil El Hindi, Hend Al-Khalifa
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Computation
Subjects:
Online Access:https://www.mdpi.com/2079-3197/13/6/151
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850156590490976256
author Mai Alammar
Khalil El Hindi
Hend Al-Khalifa
author_facet Mai Alammar
Khalil El Hindi
Hend Al-Khalifa
author_sort Mai Alammar
collection DOAJ
description Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.
format Article
id doaj-art-916511a37c7744ccbce19ea625e6a6db
institution OA Journals
issn 2079-3197
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Computation
spelling doaj-art-916511a37c7744ccbce19ea625e6a6db2025-08-20T02:24:29ZengMDPI AGComputation2079-31972025-06-0113615110.3390/computation13060151English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERTMai Alammar0Khalil El Hindi1Hend Al-Khalifa2Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaSemantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.https://www.mdpi.com/2079-3197/13/6/151text chunkingArabic text chunkingsemantic chunkingsiamese networkBERTsemantic textual similarity
spellingShingle Mai Alammar
Khalil El Hindi
Hend Al-Khalifa
English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
Computation
text chunking
Arabic text chunking
semantic chunking
siamese network
BERT
semantic textual similarity
title English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_full English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_fullStr English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_full_unstemmed English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_short English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
title_sort english arabic hybrid semantic text chunking based on fine tuning bert
topic text chunking
Arabic text chunking
semantic chunking
siamese network
BERT
semantic textual similarity
url https://www.mdpi.com/2079-3197/13/6/151
work_keys_str_mv AT maialammar englisharabichybridsemantictextchunkingbasedonfinetuningbert
AT khalilelhindi englisharabichybridsemantictextchunkingbasedonfinetuningbert
AT hendalkhalifa englisharabichybridsemantictextchunkingbasedonfinetuningbert