Text Alignment in the Service of Text Reuse Detection
This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorpo...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/6/3395 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849342858230759424 |
|---|---|
| author | Hadar Miller Tsvi Kuflik Moshe Lavee |
| author_facet | Hadar Miller Tsvi Kuflik Moshe Lavee |
| author_sort | Hadar Miller |
| collection | DOAJ |
| description | This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions. |
| format | Article |
| id | doaj-art-a4b3b807379f40a980f652ca51f74e50 |
| institution | Kabale University |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-a4b3b807379f40a980f652ca51f74e502025-08-20T03:43:14ZengMDPI AGApplied Sciences2076-34172025-03-01156339510.3390/app15063395Text Alignment in the Service of Text Reuse DetectionHadar Miller0Tsvi Kuflik1Moshe Lavee2Department of Information Systems, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelDepartment of Information Systems, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelDepartment of Jewish History, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelThis study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions.https://www.mdpi.com/2076-3417/15/6/3395text alignmenttext reuse detectionnatural language processingword embeddingsSmith–Waterman algorithmancient languages |
| spellingShingle | Hadar Miller Tsvi Kuflik Moshe Lavee Text Alignment in the Service of Text Reuse Detection Applied Sciences text alignment text reuse detection natural language processing word embeddings Smith–Waterman algorithm ancient languages |
| title | Text Alignment in the Service of Text Reuse Detection |
| title_full | Text Alignment in the Service of Text Reuse Detection |
| title_fullStr | Text Alignment in the Service of Text Reuse Detection |
| title_full_unstemmed | Text Alignment in the Service of Text Reuse Detection |
| title_short | Text Alignment in the Service of Text Reuse Detection |
| title_sort | text alignment in the service of text reuse detection |
| topic | text alignment text reuse detection natural language processing word embeddings Smith–Waterman algorithm ancient languages |
| url | https://www.mdpi.com/2076-3417/15/6/3395 |
| work_keys_str_mv | AT hadarmiller textalignmentintheserviceoftextreusedetection AT tsvikuflik textalignmentintheserviceoftextreusedetection AT moshelavee textalignmentintheserviceoftextreusedetection |