Keyphrase generation for the Russian-language scientific texts using mT5

In this work, we applied the multilingual text-to-text transformer (mT5) to the task of keyphrase generation for Russian scientific texts using the Keyphrases CS&Math Russian corpus. The automatic selection of keyphrases is a relevant task of natural language processing since keyphrases help...

Full description

Saved in:

Bibliographic Details
Main Authors:	Anna V. Glazkova, Dmitry A. Morozov, Marina S. Vorobeva, Andrey Stupnikov
Format:	Article
Language:	English
Published:	Yaroslavl State University 2023-12-01
Series:	Моделирование и анализ информационных систем
Subjects:	automatic text summarization selecting keyphrases mt5
Online Access:	https://www.mais-journal.ru/jour/article/view/1829
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849688251260993536
author	Anna V. Glazkova Dmitry A. Morozov Marina S. Vorobeva Andrey Stupnikov
author_facet	Anna V. Glazkova Dmitry A. Morozov Marina S. Vorobeva Andrey Stupnikov
author_sort	Anna V. Glazkova
collection	DOAJ
description	In this work, we applied the multilingual text-to-text transformer (mT5) to the task of keyphrase generation for Russian scientific texts using the Keyphrases CS&Math Russian corpus. The automatic selection of keyphrases is a relevant task of natural language processing since keyphrases help readers find the article easily and facilitate the systematization of scientific texts. In this paper, the task of keyphrase selection is considered as a text summarization task. The mT5 model was fine-tuned on the texts of abstracts of Russian research papers. We used abstracts as an input of the model and lists of keyphrases separated with commas as an output. The results of mT5 were compared with several baselines, including TopicRank, YAKE!, RuTermExtract, and KeyBERT. The results are reported in terms of the full-match F1-score, ROUGE-1, and BERTScore. The best results on the test set were obtained by mT5 and RuTermExtract. The highest F1-score is demonstrated by mT5 (11,24 %), exceeding RuTermExtract by 0,22 %. RuTermextract shows the highest score for ROUGE-1 (15,12 %). According to BERTScore, the best results were also obtained using these methods: mT5 — 76,89 % (BERTScore using mBERT), RuTermExtract — 75,8 % (BERTScore using ruSciBERT). Moreover, we evaluated the capability of mT5 for predicting the keyphrases that are absent in the source text. The important limitations of the proposed approach are the necessity of having a training sample for fine-tuning and probably limited suitability of the fine-tuned model in cross-domain settings. The advantages of keyphrase generation using pre-trained mT5 are the absence of the need for defining the number and length of keyphrases and normalizing produced keyphrases, which is important for flective languages, and the ability to generate keyphrases that are not presented in the text explicitly.
format	Article
id	doaj-art-730b4b81c6fe4569a75c96c0d809579d
institution	DOAJ
issn	1818-1015 2313-5417
language	English
publishDate	2023-12-01
publisher	Yaroslavl State University
record_format	Article
series	Моделирование и анализ информационных систем
spelling	doaj-art-730b4b81c6fe4569a75c96c0d809579d2025-08-20T03:22:04ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172023-12-0130441842810.18255/1818-1015-2023-4-418-4281399Keyphrase generation for the Russian-language scientific texts using mT5Anna V. Glazkova0Dmitry A. Morozov1Marina S. Vorobeva2Andrey Stupnikov3University of Tyumen;Institute for Information Transmission Problems (Kharkevich Institute)Novosibirsk National Research State University;Institute for Information Transmission Problems (Kharkevich Institute)University of TyumenUniversity of TyumenIn this work, we applied the multilingual text-to-text transformer (mT5) to the task of keyphrase generation for Russian scientific texts using the Keyphrases CS&Math Russian corpus. The automatic selection of keyphrases is a relevant task of natural language processing since keyphrases help readers find the article easily and facilitate the systematization of scientific texts. In this paper, the task of keyphrase selection is considered as a text summarization task. The mT5 model was fine-tuned on the texts of abstracts of Russian research papers. We used abstracts as an input of the model and lists of keyphrases separated with commas as an output. The results of mT5 were compared with several baselines, including TopicRank, YAKE!, RuTermExtract, and KeyBERT. The results are reported in terms of the full-match F1-score, ROUGE-1, and BERTScore. The best results on the test set were obtained by mT5 and RuTermExtract. The highest F1-score is demonstrated by mT5 (11,24 %), exceeding RuTermExtract by 0,22 %. RuTermextract shows the highest score for ROUGE-1 (15,12 %). According to BERTScore, the best results were also obtained using these methods: mT5 — 76,89 % (BERTScore using mBERT), RuTermExtract — 75,8 % (BERTScore using ruSciBERT). Moreover, we evaluated the capability of mT5 for predicting the keyphrases that are absent in the source text. The important limitations of the proposed approach are the necessity of having a training sample for fine-tuning and probably limited suitability of the fine-tuned model in cross-domain settings. The advantages of keyphrase generation using pre-trained mT5 are the absence of the need for defining the number and length of keyphrases and normalizing produced keyphrases, which is important for flective languages, and the ability to generate keyphrases that are not presented in the text explicitly.https://www.mais-journal.ru/jour/article/view/1829automatic text summarizationselecting keyphrasesmt5
spellingShingle	Anna V. Glazkova Dmitry A. Morozov Marina S. Vorobeva Andrey Stupnikov Keyphrase generation for the Russian-language scientific texts using mT5 Моделирование и анализ информационных систем automatic text summarization selecting keyphrases mt5
title	Keyphrase generation for the Russian-language scientific texts using mT5
title_full	Keyphrase generation for the Russian-language scientific texts using mT5
title_fullStr	Keyphrase generation for the Russian-language scientific texts using mT5
title_full_unstemmed	Keyphrase generation for the Russian-language scientific texts using mT5
title_short	Keyphrase generation for the Russian-language scientific texts using mT5
title_sort	keyphrase generation for the russian language scientific texts using mt5
topic	automatic text summarization selecting keyphrases mt5
url	https://www.mais-journal.ru/jour/article/view/1829
work_keys_str_mv	AT annavglazkova keyphrasegenerationfortherussianlanguagescientifictextsusingmt5 AT dmitryamorozov keyphrasegenerationfortherussianlanguagescientifictextsusingmt5 AT marinasvorobeva keyphrasegenerationfortherussianlanguagescientifictextsusingmt5 AT andreystupnikov keyphrasegenerationfortherussianlanguagescientifictextsusingmt5

Keyphrase generation for the Russian-language scientific texts using mT5

Similar Items