Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Yaroslavl State University
2024-06-01
|
| Series: | Моделирование и анализ информационных систем |
| Subjects: | |
| Online Access: | https://www.mais-journal.ru/jour/article/view/1855 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849688186493599744 |
|---|---|
| author | Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova |
| author_facet | Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova |
| author_sort | Dmitry A. Morozov |
| collection | DOAJ |
| description | The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set. |
| format | Article |
| id | doaj-art-d6daf9bb110a47beb0c56dcaf65ecc31 |
| institution | DOAJ |
| issn | 1818-1015 2313-5417 |
| language | English |
| publishDate | 2024-06-01 |
| publisher | Yaroslavl State University |
| record_format | Article |
| series | Моделирование и анализ информационных систем |
| spelling | doaj-art-d6daf9bb110a47beb0c56dcaf65ecc312025-08-20T03:22:04ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172024-06-0131220622010.18255/1818-1015-2024-2-206-2201412Keywords, morpheme parsing and syntactic trees: features for text complexity assessmentDmitry A. Morozov0Ivan A. Smal1Timur A. Garipov2Anna V. Glazkova3Novosibirsk National Research State UniversityNovosibirsk National Research State UniversityNovosibirsk National Research State UniversityUniversity of TyumenThe text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.https://www.mais-journal.ru/jour/article/view/1855text complexitykeyword generationmorpheme parsing generationsyntax trees |
| spellingShingle | Dmitry A. Morozov Ivan A. Smal Timur A. Garipov Anna V. Glazkova Keywords, morpheme parsing and syntactic trees: features for text complexity assessment Моделирование и анализ информационных систем text complexity keyword generation morpheme parsing generation syntax trees |
| title | Keywords, morpheme parsing and syntactic trees: features for text complexity assessment |
| title_full | Keywords, morpheme parsing and syntactic trees: features for text complexity assessment |
| title_fullStr | Keywords, morpheme parsing and syntactic trees: features for text complexity assessment |
| title_full_unstemmed | Keywords, morpheme parsing and syntactic trees: features for text complexity assessment |
| title_short | Keywords, morpheme parsing and syntactic trees: features for text complexity assessment |
| title_sort | keywords morpheme parsing and syntactic trees features for text complexity assessment |
| topic | text complexity keyword generation morpheme parsing generation syntax trees |
| url | https://www.mais-journal.ru/jour/article/view/1855 |
| work_keys_str_mv | AT dmitryamorozov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT ivanasmal keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT timuragaripov keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment AT annavglazkova keywordsmorphemeparsingandsyntactictreesfeaturesfortextcomplexityassessment |