A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techn...

Full description

Saved in:

Bibliographic Details
Main Authors:	Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, Junwen Xing
Format:	Article
Language:	English
Published:	Wiley 2014-01-01
Series:	The Scientific World Journal
Online Access:	http://dx.doi.org/10.1155/2014/745485
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832554658437005312
author	Longyue Wang Derek F. Wong Lidia S. Chao Yi Lu Junwen Xing
author_facet	Longyue Wang Derek F. Wong Lidia S. Chao Yi Lu Junwen Xing
author_sort	Longyue Wang
collection	DOAJ
description	Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
format	Article
id	doaj-art-b93500a94c6a4267ab98b11b7f8c88cb
institution	Kabale University
issn	2356-6140 1537-744X
language	English
publishDate	2014-01-01
publisher	Wiley
record_format	Article
series	The Scientific World Journal
spelling	doaj-art-b93500a94c6a4267ab98b11b7f8c88cb2025-02-03T05:50:56ZengWileyThe Scientific World Journal2356-61401537-744X2014-01-01201410.1155/2014/745485745485A Systematic Comparison of Data Selection Criteria for SMT Domain AdaptationLongyue Wang0Derek F. Wong1Lidia S. Chao2Yi Lu3Junwen Xing4Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, ChinaNatural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, ChinaNatural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, ChinaNatural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, ChinaNatural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, ChinaData selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.http://dx.doi.org/10.1155/2014/745485
spellingShingle	Longyue Wang Derek F. Wong Lidia S. Chao Yi Lu Junwen Xing A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation The Scientific World Journal
title	A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_full	A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_fullStr	A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_full_unstemmed	A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_short	A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_sort	systematic comparison of data selection criteria for smt domain adaptation
url	http://dx.doi.org/10.1155/2014/745485
work_keys_str_mv	AT longyuewang asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT derekfwong asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT lidiaschao asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT yilu asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT junwenxing asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT longyuewang systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT derekfwong systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT lidiaschao systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT yilu systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation AT junwenxing systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Similar Items