Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration

Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for C...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhengye Xu, Yixun Li, Duo Liu
Format:	Article
Language:	English
Published:	National Research University Higher School of Economics 2024-12-01
Series:	Journal of Language and Education
Subjects:	Chinese linguistic features Random Forest readability models Support Vector Machine
Online Access:	https://jle.hse.ru/article/view/22221
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850084031651119104
author	Zhengye Xu Yixun Li Duo Liu
author_facet	Zhengye Xu Yixun Li Duo Liu
author_sort	Zhengye Xu
collection	DOAJ
description	Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context.
format	Article
id	doaj-art-b91837b66d02460caabc6906ea0b29bc
institution	DOAJ
issn	2411-7390
language	English
publishDate	2024-12-01
publisher	National Research University Higher School of Economics
record_format	Article
series	Journal of Language and Education
spelling	doaj-art-b91837b66d02460caabc6906ea0b29bc2025-08-20T02:44:09ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22221Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based ExplorationZhengye Xu0Yixun Li1Duo Liu2The Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, China Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context. https://jle.hse.ru/article/view/22221Chineselinguistic featuresRandom Forestreadability modelsSupport Vector Machine
spellingShingle	Zhengye Xu Yixun Li Duo Liu Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration Journal of Language and Education Chinese linguistic features Random Forest readability models Support Vector Machine
title	Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_full	Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_fullStr	Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_full_unstemmed	Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_short	Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
title_sort	predictions of multilevel linguistic features to readability of hong kong primary school textbooks a machine learning based exploration
topic	Chinese linguistic features Random Forest readability models Support Vector Machine
url	https://jle.hse.ru/article/view/22221
work_keys_str_mv	AT zhengyexu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration AT yixunli predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration AT duoliu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration

Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration

Similar Items