Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were ca...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Yaroslavl State University
2022-12-01
|
| Series: | Моделирование и анализ информационных систем |
| Subjects: | |
| Online Access: | https://www.mais-journal.ru/jour/article/view/1750 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850025685767159808 |
|---|---|
| author | Ksenia Vladimirovna Lagutina |
| author_facet | Ksenia Vladimirovna Lagutina |
| author_sort | Ksenia Vladimirovna Lagutina |
| collection | DOAJ |
| description | The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification. |
| format | Article |
| id | doaj-art-e16e980aa07b4ee3b6e024605849d21e |
| institution | DOAJ |
| issn | 1818-1015 2313-5417 |
| language | English |
| publishDate | 2022-12-01 |
| publisher | Yaroslavl State University |
| record_format | Article |
| series | Моделирование и анализ информационных систем |
| spelling | doaj-art-e16e980aa07b4ee3b6e024605849d21e2025-08-20T03:00:45ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172022-12-0129433434710.18255/1818-1015-2022-4-334-3471355Classification of Russian Texts by Genres Based on Modern Embeddings and RhythmKsenia Vladimirovna Lagutina0P. G. Demidov Yaroslavl State UniversityThe article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.https://www.mais-journal.ru/jour/article/view/1750stylometrynatural language processingrhythm featuresgenrestext classificationbertelmo |
| spellingShingle | Ksenia Vladimirovna Lagutina Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm Моделирование и анализ информационных систем stylometry natural language processing rhythm features genres text classification bert elmo |
| title | Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm |
| title_full | Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm |
| title_fullStr | Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm |
| title_full_unstemmed | Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm |
| title_short | Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm |
| title_sort | classification of russian texts by genres based on modern embeddings and rhythm |
| topic | stylometry natural language processing rhythm features genres text classification bert elmo |
| url | https://www.mais-journal.ru/jour/article/view/1750 |
| work_keys_str_mv | AT kseniavladimirovnalagutina classificationofrussiantextsbygenresbasedonmodernembeddingsandrhythm |