English-Arabic Hybrid Semantic Text Chunking Based on Fine-Tuning BERT
Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this p...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Computation |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2079-3197/13/6/151 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Semantic text chunking refers to segmenting text into coherently semantic chunks, i.e., into sets of statements that are semantically related. Semantic chunking is an essential pre-processing step in various NLP tasks e.g., document summarization, sentiment analysis and question answering. In this paper, we propose a hybrid chunking; two-steps semantic text chunking method that combines the effectiveness of unsupervised semantic text chunking based on the similarities between sentences embeddings and the pre-trained language models (PLMs) especially BERT by fine-tuning the BERT on semantic textual similarity task (STS) to provide a flexible and effective semantic text chunking. We evaluated the proposed method in English and Arabic. To the best of our knowledge, there is an absence of an Arabic dataset created to assess semantic text chunking at this level. Therefore, we created an AraWiki50k to evaluate our proposed text chunking method inspired by an existing English dataset. Our experiments showed that exploiting the fine-tuned pre-trained BERT on STS enhances results over unsupervised semantic chunking by an average of 7.4 in the PK metric and by an average of 11.19 in the WindowDiff metric on four English evaluation datasets, and 0.12 in the PK and 2.29 in the WindowDiff for the Arabic dataset. |
|---|---|
| ISSN: | 2079-3197 |