Text Alignment in the Service of Text Reuse Detection

This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorpo...

Full description

Saved in:
Bibliographic Details
Main Authors: Hadar Miller, Tsvi Kuflik, Moshe Lavee
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/6/3395
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849342858230759424
author Hadar Miller
Tsvi Kuflik
Moshe Lavee
author_facet Hadar Miller
Tsvi Kuflik
Moshe Lavee
author_sort Hadar Miller
collection DOAJ
description This study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions.
format Article
id doaj-art-a4b3b807379f40a980f652ca51f74e50
institution Kabale University
issn 2076-3417
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-a4b3b807379f40a980f652ca51f74e502025-08-20T03:43:14ZengMDPI AGApplied Sciences2076-34172025-03-01156339510.3390/app15063395Text Alignment in the Service of Text Reuse DetectionHadar Miller0Tsvi Kuflik1Moshe Lavee2Department of Information Systems, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelDepartment of Information Systems, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelDepartment of Jewish History, University of Haifa, 99 Aba Hushi St., Haifa 3498838, IsraelThis study introduces a novel approach to text alignment tailored for ancient languages, with a focus on Hebrew and Aramaic, aimed at enhancing text reuse detection. Unlike previous methods, our approach integrates multiple NLP components into a specialized comparison pipeline, which is then incorporated into the Smith–Waterman algorithm. This integration enables improved alignment accuracy, particularly for historical texts characterized by fluctuations, orthographic changes, transcription variations, and word transpositions. Our key contributions include (1) a refined distance function that integrates fastText embeddings, allowing robust handling of out-of-vocabulary words; (2) a typological correction mechanism that can be integrated into automatic transcription pipelines to enhance text normalization; and (3) an evaluation of historical Hebrew texts, demonstrating an 11% improvement in the F1 score over existing approaches. These findings underscore the importance of computational methodologies in digital humanities and lay the groundwork for future multilingual extensions.https://www.mdpi.com/2076-3417/15/6/3395text alignmenttext reuse detectionnatural language processingword embeddingsSmith–Waterman algorithmancient languages
spellingShingle Hadar Miller
Tsvi Kuflik
Moshe Lavee
Text Alignment in the Service of Text Reuse Detection
Applied Sciences
text alignment
text reuse detection
natural language processing
word embeddings
Smith–Waterman algorithm
ancient languages
title Text Alignment in the Service of Text Reuse Detection
title_full Text Alignment in the Service of Text Reuse Detection
title_fullStr Text Alignment in the Service of Text Reuse Detection
title_full_unstemmed Text Alignment in the Service of Text Reuse Detection
title_short Text Alignment in the Service of Text Reuse Detection
title_sort text alignment in the service of text reuse detection
topic text alignment
text reuse detection
natural language processing
word embeddings
Smith–Waterman algorithm
ancient languages
url https://www.mdpi.com/2076-3417/15/6/3395
work_keys_str_mv AT hadarmiller textalignmentintheserviceoftextreusedetection
AT tsvikuflik textalignmentintheserviceoftextreusedetection
AT moshelavee textalignmentintheserviceoftextreusedetection