Keywords, morpheme parsing and syntactic trees: features for text complexity assessment

The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are...

Full description

Saved in:

Bibliographic Details
Main Authors:	Dmitry A. Morozov, Ivan A. Smal, Timur A. Garipov, Anna V. Glazkova
Format:	Article
Language:	English
Published:	Yaroslavl State University 2024-06-01
Series:	Моделирование и анализ информационных систем
Subjects:	text complexity keyword generation morpheme parsing generation syntax trees
Online Access:	https://www.mais-journal.ru/jour/article/view/1855
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849688186493599744
author	Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova
author_facet	Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova
author_sort	Dmitry A. Morozov
collection	DOAJ
description	The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.
format	Article
id	doaj-art-d6daf9bb110a47beb0c56dcaf65ecc31
institution	DOAJ
issn	1818-1015 2313-5417
language	English
publishDate	2024-06-01
publisher	Yaroslavl State University
record_format	Article
series	Моделирование и анализ информационных систем
spelling	doaj-art-d6daf9bb110a47beb0c56dcaf65ecc312025-08-20T03:22:04ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172024-06-0131220622010.18255/1818-1015-2024-2-206-2201412Keywords, morpheme parsing and syntactic trees: features for text complexity assessmentDmitry A. Morozov0Ivan A. Smal1Timur A. Garipov2Anna V. Glazkova3Novosibirsk National Research State UniversityNovosibirsk National Research State UniversityNovosibirsk National Research State UniversityUniversity of TyumenThe text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.https://www.mais-journal.ru/jour/article/view/1855text complexitykeyword generationmorpheme parsing generationsyntax trees
spellingShingle	Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova Keywords, morpheme parsing and syntactic trees: features for text complexity assessment Моделирование и анализ информационных систем text complexity keyword generation morpheme parsing generation syntax trees
title	Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_full	Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_fullStr	Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_full_unstemmed	Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_short	Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
title_sort	keywords morpheme parsing and syntactic trees features for text complexity assessment
topic	text complexity keyword generation morpheme parsing generation syntax trees
url	https://www.mais-journal.ru/jour/article/view/1855
work_keys_str_mv	AT dmitryamorozov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT ivanasmal keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT timuragaripov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT annavglazkova keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment

Keywords, morpheme parsing and syntactic trees: features for text complexity assessment

Similar Items