Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration

Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for C...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhengye Xu, Yixun Li, Duo Liu
Format: Article
Language:English
Published: National Research University Higher School of Economics 2024-12-01
Series:Journal of Language and Education
Subjects:
Online Access:https://jle.hse.ru/article/view/22221
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841556000724746240
author Zhengye Xu
Yixun Li
Duo Liu
author_facet Zhengye Xu
Yixun Li
Duo Liu
author_sort Zhengye Xu
collection DOAJ
description Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context.
format Article
id doaj-art-b91837b66d02460caabc6906ea0b29bc
institution Kabale University
issn 2411-7390
language English
publishDate 2024-12-01
publisher National Research University Higher School of Economics
record_format Article
series Journal of Language and Education
spelling doaj-art-b91837b66d02460caabc6906ea0b29bc2025-01-07T16:17:18ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22221Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based ExplorationZhengye Xu0Yixun Li1Duo Liu2The Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, China Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context. https://jle.hse.ru/article/view/22221Chineselinguistic featuresRandom Forestreadability modelsSupport Vector Machine
spellingShingle Zhengye Xu
Yixun Li
Duo Liu
Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
Journal of Language and Education
Chinese
linguistic features
Random Forest
readability models
Support Vector Machine
title Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_full Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_fullStr Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_full_unstemmed Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_short Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_sort predictions of multilevel linguistic features to readability of hong kong primary school textbooks a machine learning based exploration
topic Chinese
linguistic features
Random Forest
readability models
Support Vector Machine
url https://jle.hse.ru/article/view/22221
work_keys_str_mv AT zhengyexu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration
AT yixunli predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration
AT duoliu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration