English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this p...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Computation |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2079-3197/13/6/151 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850156590490976256 |
|---|---|
| author | Mai Alammar Khalil El Hindi Hend Al-Khalifa |
| author_facet | Mai Alammar Khalil El Hindi Hend Al-Khalifa |
| author_sort | Mai Alammar |
| collection | DOAJ |
| description | Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset. |
| format | Article |
| id | doaj-art-916511a37c7744ccbce19ea625e6a6db |
| institution | OA Journals |
| issn | 2079-3197 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Computation |
| spelling | doaj-art-916511a37c7744ccbce19ea625e6a6db2025-08-20T02:24:29ZengMDPI AGComputation2079-31972025-06-0113615110.3390/computation13060151English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERTMai Alammar0Khalil El Hindi1Hend Al-Khalifa2Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi ArabiaSemantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset.https://www.mdpi.com/2079-3197/13/6/151text chunkingArabic text chunkingsemantic chunkingsiamese networkBERTsemantic textual similarity |
| spellingShingle | Mai Alammar Khalil El Hindi Hend Al-Khalifa English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT Computation text chunking Arabic text chunking semantic chunking siamese network BERT semantic textual similarity |
| title | English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT |
| title_full | English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT |
| title_fullStr | English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT |
| title_full_unstemmed | English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT |
| title_short | English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT |
| title_sort | english arabic hybrid semantic text chunking based on fine tuning bert |
| topic | text chunking Arabic text chunking semantic chunking siamese network BERT semantic textual similarity |
| url | https://www.mdpi.com/2079-3197/13/6/151 |
| work_keys_str_mv | AT maialammar englisharabichybridsemantictextchunkingbasedonfinetuningbert AT khalilelhindi englisharabichybridsemantictextchunkingbasedonfinetuningbert AT hendalkhalifa englisharabichybridsemantictextchunkingbasedonfinetuningbert |