Machine Reading Comprehension for the Tamil Language With Translated SQuAD

Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools,...

Full description

Saved in:
Bibliographic Details
Main Authors: Anton Vijeevaraj Ann Sinthusha, Eugene Y. A. Charles, Ruvan Weerasinghe
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10844269/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832586834642731008
author Anton Vijeevaraj Ann Sinthusha
Eugene Y. A. Charles
Ruvan Weerasinghe
author_facet Anton Vijeevaraj Ann Sinthusha
Eugene Y. A. Charles
Ruvan Weerasinghe
author_sort Anton Vijeevaraj Ann Sinthusha
collection DOAJ
description Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.
format Article
id doaj-art-018127326f294f38980fc8447a7e45e6
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-018127326f294f38980fc8447a7e45e62025-01-25T00:00:42ZengIEEEIEEE Access2169-35362025-01-0113133121332810.1109/ACCESS.2025.353094910844269Machine Reading Comprehension for the Tamil Language With Translated SQuADAnton Vijeevaraj Ann Sinthusha0Eugene Y. A. Charles1https://orcid.org/0000-0002-0678-3486Ruvan Weerasinghe2University of Colombo School of Computing, Colombo, Sri LankaDepartment of Computer Science, University of Jaffna, Jaffna, Sri LankaUniversity of Colombo School of Computing, Colombo, Sri LankaMachine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.https://ieeexplore.ieee.org/document/10844269/Machine reading comprehension (MRC)natural language processing (NLP)multilingual language modelslow-resourced languages (LRL)Tamil language processingtransformer models
spellingShingle Anton Vijeevaraj Ann Sinthusha
Eugene Y. A. Charles
Ruvan Weerasinghe
Machine Reading Comprehension for the Tamil Language With Translated SQuAD
IEEE Access
Machine reading comprehension (MRC)
natural language processing (NLP)
multilingual language models
low-resourced languages (LRL)
Tamil language processing
transformer models
title Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_full Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_fullStr Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_full_unstemmed Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_short Machine Reading Comprehension for the Tamil Language With Translated SQuAD
title_sort machine reading comprehension for the tamil language with translated squad
topic Machine reading comprehension (MRC)
natural language processing (NLP)
multilingual language models
low-resourced languages (LRL)
Tamil language processing
transformer models
url https://ieeexplore.ieee.org/document/10844269/
work_keys_str_mv AT antonvijeevarajannsinthusha machinereadingcomprehensionforthetamillanguagewithtranslatedsquad
AT eugeneyacharles machinereadingcomprehensionforthetamillanguagewithtranslatedsquad
AT ruvanweerasinghe machinereadingcomprehensionforthetamillanguagewithtranslatedsquad