A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore

<sc><bold>Bleurt</bold></sc> is a recently introduced metric that employs <sc>Bert</sc>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While tradition...

Full description

Saved in:

Bibliographic Details
Main Authors:	Aniruddha Mukherjee, Vikas Hassija, Vinay Chamola, Karunesh Kumar Gupta
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Open Journal of the Computer Society
Subjects:	Natural language processing deep learning machine learning metrics machine translation
Online Access:	https://ieeexplore.ieee.org/document/10964149/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850255709005938688
author	Aniruddha Mukherjee Vikas Hassija Vinay Chamola Karunesh Kumar Gupta
author_facet	Aniruddha Mukherjee Vikas Hassija Vinay Chamola Karunesh Kumar Gupta
author_sort	Aniruddha Mukherjee
collection	DOAJ
description	<sc><bold>Bleurt</bold></sc> is a recently introduced metric that employs <sc>Bert</sc>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics like<sc>Bleu</sc> rely on lexical similarities, <sc>Bleurt</sc> leverages <sc>Bert</sc>’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that <sc>Bert</sc>, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how <sc>Bleurt</sc> handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of <sc>Bleurt</sc>, how do they align with <sc>Bert</sc>’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, <sc>BERTScore</sc>? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that <sc>Bleurt</sc> excels at identifying nuanced differences between sentences with high overlap, an area where <sc>BERTScore</sc> shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.
format	Article
id	doaj-art-917ee0ef027a41f1b6bee08d9e6efde9
institution	OA Journals
issn	2644-1268
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of the Computer Society
spelling	doaj-art-917ee0ef027a41f1b6bee08d9e6efde92025-08-20T01:56:48ZengIEEEIEEE Open Journal of the Computer Society2644-12682025-01-01665866810.1109/OJCS.2025.356033310964149A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScoreAniruddha Mukherjee0https://orcid.org/0009-0002-1273-8558Vikas Hassija1https://orcid.org/0009-0003-9023-1661Vinay Chamola2https://orcid.org/0000-0002-6730-3060Karunesh Kumar Gupta3https://orcid.org/0000-0002-0003-4601School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT) Deemed to be University, Bhubaneswar, Odisha, IndiaSchool of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT) Deemed to be University, Bhubaneswar, Odisha, IndiaDepartment of Electrical and Electronics Engineering, Birla Institute of Technology and Science, Pilani, Pilani Campus, Vidya Vihar, Pilani, Rajasthan, IndiaDepartment of Electrical and Electronics Engineering, Birla Institute of Technology and Science, Pilani, Pilani Campus, Vidya Vihar, Pilani, Rajasthan, India<sc><bold>Bleurt</bold></sc> is a recently introduced metric that employs <sc>Bert</sc>, a potent pre-trained language model to assess how well candidate translations compare to a reference translation in the context of machine translation outputs. While traditional metrics like<sc>Bleu</sc> rely on lexical similarities, <sc>Bleurt</sc> leverages <sc>Bert</sc>’s semantic and syntactic capabilities to provide more robust evaluation through complex text representations. However, studies have shown that <sc>Bert</sc>, despite its impressive performance in natural language processing tasks can sometimes deviate from human judgment, particularly in specific syntactic and semantic scenarios. Through systematic experimental analysis at the word level, including categorization of errors such as lexical mismatches, untranslated terms, and structural inconsistencies, we investigate how <sc>Bleurt</sc> handles various translation challenges. Our study addresses three central questions: What are the strengths and weaknesses of <sc>Bleurt</sc>, how do they align with <sc>Bert</sc>’s known limitations, and how does it compare with the similar automatic neural metric for machine translation, <sc>BERTScore</sc>? Using manually annotated datasets that emphasize different error types and linguistic phenomena, we find that <sc>Bleurt</sc> excels at identifying nuanced differences between sentences with high overlap, an area where <sc>BERTScore</sc> shows limitations. Our systematic experiments, provide insights for their effective application in machine translation evaluation.https://ieeexplore.ieee.org/document/10964149/Natural language processingdeep learningmachine learningmetricsmachine translation
spellingShingle	Aniruddha Mukherjee Vikas Hassija Vinay Chamola Karunesh Kumar Gupta A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore IEEE Open Journal of the Computer Society Natural language processing deep learning machine learning metrics machine translation
title	A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
title_full	A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
title_fullStr	A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
title_full_unstemmed	A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
title_short	A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore
title_sort	detailed comparative analysis of automatic neural metrics for machine translation bleurt amp bertscore
topic	Natural language processing deep learning machine learning metrics machine translation
url	https://ieeexplore.ieee.org/document/10964149/
work_keys_str_mv	AT aniruddhamukherjee adetailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT vikashassija adetailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT vinaychamola adetailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT karuneshkumargupta adetailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT aniruddhamukherjee detailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT vikashassija detailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT vinaychamola detailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore AT karuneshkumargupta detailedcomparativeanalysisofautomaticneuralmetricsformachinetranslationbleurtampbertscore

A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT &amp; BERTScore

Similar Items

A Detailed Comparative Analysis of Automatic Neural Metrics for Machine Translation: BLEURT & BERTScore