Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents
This study introduces a novel method for explainable speaking skill assessment that utilizes a unique dataset featuring video recordings of conversational interviews for high-stakes outcomes (i.e., admission to high schools and universities). Unlike traditional automated speaking assessments that pr...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Computers and Education: Artificial Intelligence |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666920X25000268 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850223263180914688 |
|---|---|
| author | Candy Olivia Mawalim Chee Wee Leong Guy Sivan Hung-Hsuan Huang Shogo Okada |
| author_facet | Candy Olivia Mawalim Chee Wee Leong Guy Sivan Hung-Hsuan Huang Shogo Okada |
| author_sort | Candy Olivia Mawalim |
| collection | DOAJ |
| description | This study introduces a novel method for explainable speaking skill assessment that utilizes a unique dataset featuring video recordings of conversational interviews for high-stakes outcomes (i.e., admission to high schools and universities). Unlike traditional automated speaking assessments that prioritize accuracy at the expense of interpretability, our approach employs a new multimodal dataset that integrates acoustic and linguistic features, visual cues, turn-taking patterns, and expert-derived scores quantifying various speaking skill aspects observed during interviews with young adolescents. This dataset is distinguished by its open-ended question format, which allows for varied responses from interviewees, providing a rich basis for analysis. The experimental results demonstrate that fusing interpretable features, including prosody, action units, and turn-taking, significantly enhances the accuracy of spoken English skill prediction, achieving an overall accuracy of 83% when a machine learning model based on the light gradient boosting algorithm is used. Furthermore, this research underscores the significant influence of external factors, such as interviewer behavior and the interview setting, particularly on the coherence aspect of spoken English proficiency. This focus on an innovative dataset and interpretable assessment tools offers a more nuanced understanding of speaking skills in high-stakes contexts than that offered by previous studies. |
| format | Article |
| id | doaj-art-66df0d83831e40edbe8045ccdada3baa |
| institution | OA Journals |
| issn | 2666-920X |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computers and Education: Artificial Intelligence |
| spelling | doaj-art-66df0d83831e40edbe8045ccdada3baa2025-08-20T02:06:00ZengElsevierComputers and Education: Artificial Intelligence2666-920X2025-06-01810038610.1016/j.caeai.2025.100386Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescentsCandy Olivia Mawalim0Chee Wee Leong1Guy Sivan2Hung-Hsuan Huang3Shogo Okada4Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan; Corresponding author.Educational Testing Service, Princeton, NJ, USAVericant.com, Beijing, ChinaThe University of Fukuchiyama, Fukuchiyama, Kyoto, JapanJapan Advanced Institute of Science and Technology, Nomi, Ishikawa, JapanThis study introduces a novel method for explainable speaking skill assessment that utilizes a unique dataset featuring video recordings of conversational interviews for high-stakes outcomes (i.e., admission to high schools and universities). Unlike traditional automated speaking assessments that prioritize accuracy at the expense of interpretability, our approach employs a new multimodal dataset that integrates acoustic and linguistic features, visual cues, turn-taking patterns, and expert-derived scores quantifying various speaking skill aspects observed during interviews with young adolescents. This dataset is distinguished by its open-ended question format, which allows for varied responses from interviewees, providing a rich basis for analysis. The experimental results demonstrate that fusing interpretable features, including prosody, action units, and turn-taking, significantly enhances the accuracy of spoken English skill prediction, achieving an overall accuracy of 83% when a machine learning model based on the light gradient boosting algorithm is used. Furthermore, this research underscores the significant influence of external factors, such as interviewer behavior and the interview setting, particularly on the coherence aspect of spoken English proficiency. This focus on an innovative dataset and interpretable assessment tools offers a more nuanced understanding of speaking skills in high-stakes contexts than that offered by previous studies.http://www.sciencedirect.com/science/article/pii/S2666920X25000268Speaking skillsMultimodalInterpretabilityInterview |
| spellingShingle | Candy Olivia Mawalim Chee Wee Leong Guy Sivan Hung-Hsuan Huang Shogo Okada Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents Computers and Education: Artificial Intelligence Speaking skills Multimodal Interpretability Interview |
| title | Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
| title_full | Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
| title_fullStr | Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
| title_full_unstemmed | Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
| title_short | Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents |
| title_sort | beyond accuracy multimodal modeling of structured speaking skill indices in young adolescents |
| topic | Speaking skills Multimodal Interpretability Interview |
| url | http://www.sciencedirect.com/science/article/pii/S2666920X25000268 |
| work_keys_str_mv | AT candyoliviamawalim beyondaccuracymultimodalmodelingofstructuredspeakingskillindicesinyoungadolescents AT cheeweeleong beyondaccuracymultimodalmodelingofstructuredspeakingskillindicesinyoungadolescents AT guysivan beyondaccuracymultimodalmodelingofstructuredspeakingskillindicesinyoungadolescents AT hunghsuanhuang beyondaccuracymultimodalmodelingofstructuredspeakingskillindicesinyoungadolescents AT shogookada beyondaccuracymultimodalmodelingofstructuredspeakingskillindicesinyoungadolescents |