A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
<sc><bold>Bleurt</bold></sc> is a recently introduced metric that employs <sc>Bert</sc>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While tradition...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Open Journal of the Computer Society |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10964149/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | <sc><bold>Bleurt</bold></sc> is a recently introduced metric that employs <sc>Bert</sc>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics like<sc>Bleu</sc> rely on lexical similarities, <sc>Bleurt</sc> leverages <sc>Bert</sc>’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that <sc>Bert</sc>, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how <sc>Bleurt</sc> handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of <sc>Bleurt</sc>, how do they align with <sc>Bert</sc>’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, <sc>BERTScore</sc>? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that <sc>Bleurt</sc> excels at identifying nuanced differences between sentences with high overlap, an area where <sc>BERTScore</sc> shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation. |
|---|---|
| ISSN: | 2644-1268 |