Machine Reading Comprehension for the Tamil Language With Translated SQuAD
Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools,...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10844269/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832586834642731008 |
---|---|
author | Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe |
author_facet | Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe |
author_sort | Anton Vijeevaraj Ann Sinthusha |
collection | DOAJ |
description | Machine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages. |
format | Article |
id | doaj-art-018127326f294f38980fc8447a7e45e6 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-018127326f294f38980fc8447a7e45e62025-01-25T00:00:42ZengIEEEIEEE Access2169-35362025-01-0113133121332810.1109/ACCESS.2025.353094910844269Machine Reading Comprehension for the Tamil Language With Translated SQuADAnton Vijeevaraj Ann Sinthusha0Eugene Y. A. Charles1https://orcid.org/0000-0002-0678-3486Ruvan Weerasinghe2University of Colombo School of Computing, Colombo, Sri LankaDepartment of Computer Science, University of Jaffna, Jaffna, Sri LankaUniversity of Colombo School of Computing, Colombo, Sri LankaMachine Reading Comprehension (MRC) is a challenging task in Natural Language Processing (NLP), crucial for automated customer support, enabling chatbots and virtual assistants to accurately understand and respond to queries. It also enhances question-answering systems, benefiting educational tools, search engines, and help desks. The introduction of attention-based transformer models has significantly boosted MRC performance, especially for well-resourced languages such as English. However, MRC for low-resourced languages (LRL) remains an ongoing research area. Although Large Language Models show exceptional NLP performance, they are less effective for LRL and are expensive to train and deploy. Consequently, simpler language models that are targeted at specific tasks remain viable for these languages. This research examines high-performing language models on the Tamil MRC task, detailing the preparation of a Tamil-translated and processed SQuAD dataset to establish a standard dataset for Tamil MRC. The study analyzes the performance of multilingual transformer models on the Tamil MRC task, including Multilingual DistilBERT, Multilingual BERT, XLM-RoBERTa, MuRIL, and RemBERT. Additionally, it explores improving these models’ performance by fine-tuning them with English SQuAD, Tamil SQuAD, and a newly developed Tamil Short Story (TSS) dataset for MRC. Tamil’s agglutinative nature, which expresses grammatical information through suffixation, results in a high degree of word inflexion. Given this characteristic, the BERT score was chosen as the evaluation metric for MRC performance. The analysis shows that the XLM-RoBERTa model outperformed the others for Tamil MRC, achieving a BERT score of 86.29% on the TSS MRC test set. This superior performance is attributed to the model’s cross-lingual learning capability and the increased number of data records used for fine-tuning. The research underscores the necessity of language-specific fine-tuning of multilingual models to enhance NLP task performance, for low-resourced languages.https://ieeexplore.ieee.org/document/10844269/Machine reading comprehension (MRC)natural language processing (NLP)multilingual language modelslow-resourced languages (LRL)Tamil language processingtransformer models |
spellingShingle | Anton Vijeevaraj Ann Sinthusha Eugene Y. A. Charles Ruvan Weerasinghe Machine Reading Comprehension for the Tamil Language With Translated SQuAD IEEE Access Machine reading comprehension (MRC) natural language processing (NLP) multilingual language models low-resourced languages (LRL) Tamil language processing transformer models |
title | Machine Reading Comprehension for the Tamil Language With Translated SQuAD |
title_full | Machine Reading Comprehension for the Tamil Language With Translated SQuAD |
title_fullStr | Machine Reading Comprehension for the Tamil Language With Translated SQuAD |
title_full_unstemmed | Machine Reading Comprehension for the Tamil Language With Translated SQuAD |
title_short | Machine Reading Comprehension for the Tamil Language With Translated SQuAD |
title_sort | machine reading comprehension for the tamil language with translated squad |
topic | Machine reading comprehension (MRC) natural language processing (NLP) multilingual language models low-resourced languages (LRL) Tamil language processing transformer models |
url | https://ieeexplore.ieee.org/document/10844269/ |
work_keys_str_mv | AT antonvijeevarajannsinthusha machinereadingcomprehensionforthetamillanguagewithtranslatedsquad AT eugeneyacharles machinereadingcomprehensionforthetamillanguagewithtranslatedsquad AT ruvanweerasinghe machinereadingcomprehensionforthetamillanguagewithtranslatedsquad |