Machine Reading Comprehension for the Tamil Language With Translated SQuAD

Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Anton Vijeevaraj Ann Sinthusha, Eugene Y. A. Charles, Ruvan Weerasinghe
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Machine reading comprehension (MRC) natural language processing (NLP) multilingual language models low-resourced languages (LRL) Tamil language processing transformer models
Online Access:	https://ieeexplore.ieee.org/document/10844269/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832586834642731008
author	Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe
author_facet	Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe
author_sort	Anton Vijeevaraj Ann Sinthusha
collection	DOAJ
description	Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.
format	Article
id	doaj-art-018127326f294f38980fc8447a7e45e6
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-018127326f294f38980fc8447a7e45e62025-01-25T00:00:42ZengIEEEIEEE Access2169-35362025-01-0113133121332810.1109/ACCESS.2025.353094910844269Machine Reading Comprehension for the Tamil Language With Translated SQuADAnton Vijeevaraj Ann Sinthusha0Eugene Y. A. Charles1https://orcid.org/0000-0002-0678-3486Ruvan Weerasinghe2University of Colombo School of Computing, Colombo, Sri LankaDepartment of Computer Science, University of Jaffna, Jaffna, Sri LankaUniversity of Colombo School of Computing, Colombo, Sri LankaMachine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.https://ieeexplore.ieee.org/document/10844269/Machine reading comprehension (MRC)natural language processing (NLP)multilingual language modelslow-resourced languages (LRL)Tamil language processingtransformer models
spellingShingle	Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe Machine Reading Comprehension for the Tamil Language With Translated SQuAD IEEE Access Machine reading comprehension (MRC) natural language processing (NLP) multilingual language models low-resourced languages (LRL) Tamil language processing transformer models
title	Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_full	Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_fullStr	Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_full_unstemmed	Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_short	Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_sort	machine reading comprehension for the tamil language with translated squad
topic	Machine reading comprehension (MRC) natural language processing (NLP) multilingual language models low-resourced languages (LRL) Tamil language processing transformer models
url	https://ieeexplore.ieee.org/document/10844269/
work_keys_str_mv	AT antonvijeevarajannsinthusha machinereadingcomprehensionforthetamillanguagewithtranslatedsquad AT eugeneyacharles machinereadingcomprehensionforthetamillanguagewithtranslatedsquad AT ruvanweerasinghe machinereadingcomprehensionforthetamillanguagewithtranslatedsquad

Machine Reading Comprehension for the Tamil Language With Translated SQuAD

Similar Items