Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration
Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for C...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
National Research University Higher School of Economics
2024-12-01
|
Series: | Journal of Language and Education |
Subjects: | |
Online Access: | https://jle.hse.ru/article/view/22221 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841556000724746240 |
---|---|
author | Zhengye Xu Yixun Li Duo Liu |
author_facet | Zhengye Xu Yixun Li Duo Liu |
author_sort | Zhengye Xu |
collection | DOAJ |
description |
Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors.
Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong.
Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed.
Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels.
Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context.
|
format | Article |
id | doaj-art-b91837b66d02460caabc6906ea0b29bc |
institution | Kabale University |
issn | 2411-7390 |
language | English |
publishDate | 2024-12-01 |
publisher | National Research University Higher School of Economics |
record_format | Article |
series | Journal of Language and Education |
spelling | doaj-art-b91837b66d02460caabc6906ea0b29bc2025-01-07T16:17:18ZengNational Research University Higher School of EconomicsJournal of Language and Education2411-73902024-12-0110410.17323/jle.2024.22221Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based ExplorationZhengye Xu0Yixun Li1Duo Liu2The Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, ChinaThe Education University of Hong Kong, Tai Po, N.T., Hong Kong, China Introduction: Readability formulas are crucial for identifying suitable texts for children's reading development. Traditional formulas, however, are linear models designed for alphabetic languages and struggle with numerous predictors. Purpose: To develop advanced readability formulas for Chinese texts using machine-learning algorithms that can handle hundreds of predictors. It is also the first readability formula developed in Hong Kong. Method: The corpus comprised 723 texts from 72 Chinese language arts textbooks used in public primary schools. The study considered 274 linguistic features at the character, word, syntax, and discourse levels as predictor variables. The outcome variables were the publisher-assigned semester scale and the teacher-rated readability level. Fifteen combinations of linguistic features were trained using Support Vector Machine (SVM) and Random Forest (RF) algorithms. Model performance was evaluated by prediction accuracy and the mean absolute error between predicted and actual readability. For both publisher-assigned and teacher-rated readability, the all-level-feature-RF and character-level-feature-RF models performed the best. The top 10 predictive features of the two optimal models were analyzed. Results: Among the publisher-assigned and subjective readability measures, the all-RF and character-RF models performed the best. The feature importance analyses of these two optimal models highlight the significance of character learning sequences, character frequency, and word frequency in estimating text readability in the Chinese context of Hong Kong. In addition, the findings suggest that publishers might rely on diverse information sources to assign semesters, whereas teachers likely prefer to utilize indices that can be directly derived from the texts themselves to gauge readability levels. Conclusion: The findings highlight the importance of character-level features, particularly the timing of a character's introduction in the textbook, in predicting text readability in the Hong Kong Chinese context. https://jle.hse.ru/article/view/22221Chineselinguistic featuresRandom Forestreadability modelsSupport Vector Machine |
spellingShingle | Zhengye Xu Yixun Li Duo Liu Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration Journal of Language and Education Chinese linguistic features Random Forest readability models Support Vector Machine |
title | Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration |
title_full | Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration |
title_fullStr | Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration |
title_full_unstemmed | Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration |
title_short | Predictions of Multilevel Linguistic Features to Readability of Hong Kong Primary School Textbooks: A Machine Learning Based Exploration |
title_sort | predictions of multilevel linguistic features to readability of hong kong primary school textbooks a machine learning based exploration |
topic | Chinese linguistic features Random Forest readability models Support Vector Machine |
url | https://jle.hse.ru/article/view/22221 |
work_keys_str_mv | AT zhengyexu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration AT yixunli predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration AT duoliu predictionsofmultilevellinguisticfeaturestoreadabilityofhongkongprimaryschooltextbooksamachinelearningbasedexploration |