Listen or Read? The Impact of Proficiency and Visual Complexity on Learners’ Reliance on Captions

This study investigates how Chinese EFL (English as a foreign language) learners of low- and high-proficiency levels allocate attention between captions and audio while watching videos, and how visual complexity (single- vs. multi-speaker content) influences caption reliance. The study employed a no...

Full description

Saved in:
Bibliographic Details
Main Author: Yan Li
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Behavioral Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-328X/15/4/542
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study investigates how Chinese EFL (English as a foreign language) learners of low- and high-proficiency levels allocate attention between captions and audio while watching videos, and how visual complexity (single- vs. multi-speaker content) influences caption reliance. The study employed a novel paused transcription method to assess real-time processing. A total of 64 participants (31 low-proficiency [A1–A2] and 33 high-proficiency [C1–C2] learners) viewed single- and multi-speaker videos with English captions. Misleading captions were inserted to objectively measure reliance on captions versus audio. Results revealed significant proficiency effects: Low-proficiency learners prioritized captions (reading scores > listening, <i>Z</i> = −4.55, <i>p</i> < 0.001, <i>r</i> = 0.82), while high-proficiency learners focused on audio (listening > reading, <i>Z</i> = −5.12, <i>p</i> < 0.001, <i>r</i> = 0.89). Multi-speaker videos amplified caption reliance for low-proficiency learners (<i>r</i> = 0.75) and moderately increased reliance for high-proficiency learners (<i>r</i> = 0.52). These findings demonstrate that low-proficiency learners rely overwhelmingly on captions during video viewing, while high-proficiency learners integrate multimodal inputs. Notably, increased visual complexity amplifies caption reliance across proficiency levels. Implications are twofold: Pedagogically, educators could design tiered caption removal protocols as skills improve while incorporating adjustable caption opacity tools. Technologically, future research could focus on developing dynamic captioning systems leveraging eye-tracking and AI to adapt to real-time proficiency, optimizing learning experiences. Additionally, video complexity should be calibrated to learners’ proficiency levels.
ISSN:2076-328X