Keywords, morpheme parsing and syntactic trees: features for text complexity assessment

The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are...

Full description

Saved in:
Bibliographic Details
Main Authors: Dmitry A. Morozov, Ivan A. Smal, Timur A. Garipov, Anna V. Glazkova
Format: Article
Language:English
Published: Yaroslavl State University 2024-06-01
Series:Моделирование и анализ информационных систем
Subjects:
Online Access:https://www.mais-journal.ru/jour/article/view/1855
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849688186493599744
author Dmitry A. Morozov
Ivan A. Smal
Timur A. Garipov
Anna V. Glazkova
author_facet Dmitry A. Morozov
Ivan A. Smal
Timur A. Garipov
Anna V. Glazkova
author_sort Dmitry A. Morozov
collection DOAJ
description The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.
format Article
id doaj-art-d6daf9bb110a47beb0c56dcaf65ecc31
institution DOAJ
issn 1818-1015
2313-5417
language English
publishDate 2024-06-01
publisher Yaroslavl State University
record_format Article
series Моделирование и анализ информационных систем
spelling doaj-art-d6daf9bb110a47beb0c56dcaf65ecc312025-08-20T03:22:04ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172024-06-0131220622010.18255/1818-1015-2024-2-206-2201412Keywords, morpheme parsing and syntactic trees: features for text complexity assessmentDmitry A. Morozov0Ivan A. Smal1Timur A. Garipov2Anna V. Glazkova3Novosibirsk National Research State UniversityNovosibirsk National Research State UniversityNovosibirsk National Research State UniversityUniversity of TyumenThe text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.https://www.mais-journal.ru/jour/article/view/1855text complexitykeyword generationmorpheme parsing generationsyntax trees
spellingShingle Dmitry A. Morozov
Ivan A. Smal
Timur A. Garipov
Anna V. Glazkova
Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
Моделирование и анализ информационных систем
text complexity
keyword generation
morpheme parsing generation
syntax trees
title Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_full Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_fullStr Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_full_unstemmed Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_short Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_sort keywords morpheme parsing and syntactic trees features for text complexity assessment
topic text complexity
keyword generation
morpheme parsing generation
syntax trees
url https://www.mais-journal.ru/jour/article/view/1855
work_keys_str_mv AT dmitryamorozov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment
AT ivanasmal keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment
AT timuragaripov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment
AT annavglazkova keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment