EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models

Educational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This art...

Full description

Saved in:
Bibliographic Details
Main Authors: Changyong Qi, Longwei Zheng, Yuang Wei, Haoxin Xu, Peiji Chen, Xiaoqing Gu
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/1/154
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841549433693536256
author Changyong Qi
Longwei Zheng
Yuang Wei
Haoxin Xu
Peiji Chen
Xiaoqing Gu
author_facet Changyong Qi
Longwei Zheng
Yuang Wei
Haoxin Xu
Peiji Chen
Xiaoqing Gu
author_sort Changyong Qi
collection DOAJ
description Educational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This article presents the EduDCM framework for the first time, offering an original approach to addressing these challenges. EduDCM innovatively integrates distant supervision with the capabilities of Large Language Models (LLMs) to automate the construction of high-quality educational dialogue classification datasets. EduDCM reduces the noise typically associated with distant supervision by leveraging LLMs for context-aware label generation and incorporating heuristic alignment techniques. To validate the framework, we constructed the EduTalk dataset, encompassing diverse classroom dialogues labeled with pedagogical categories. Extensive experiments on EduTalk and publicly available datasets, combined with expert evaluations, confirm the superior quality of EduDCM-generated datasets. Models trained on EduDCM data achieved a performance comparable to that of manually annotated datasets. Expert evaluations using a 5-point Likert scale show that EduDCM outperforms Template-Based Generation and Few-Shot GPT in terms of annotation accuracy, category coverage, and consistency. These findings emphasize EduDCM’s novelty and its effectiveness in generating high-quality, scalable datasets for low-resource educational NLP tasks, thus reducing manual annotation efforts.
format Article
id doaj-art-339733682edb4125b9170b22ed296ce4
institution Kabale University
issn 2076-3417
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-339733682edb4125b9170b22ed296ce42025-01-10T13:14:38ZengMDPI AGApplied Sciences2076-34172024-12-0115115410.3390/app15010154EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language ModelsChangyong Qi0Longwei Zheng1Yuang Wei2Haoxin Xu3Peiji Chen4Xiaoqing Gu5Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaSchool of Education, City University of Macau, Macau 999078, ChinaShanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaShanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaDepartment of Mechanical Engineering and Intelligent System, The University of Electro-Communications, Tokyo 183-8585, JapanDepartment of Education Information Technology, East China Normal University, Shanghai 200062, ChinaEducational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This article presents the EduDCM framework for the first time, offering an original approach to addressing these challenges. EduDCM innovatively integrates distant supervision with the capabilities of Large Language Models (LLMs) to automate the construction of high-quality educational dialogue classification datasets. EduDCM reduces the noise typically associated with distant supervision by leveraging LLMs for context-aware label generation and incorporating heuristic alignment techniques. To validate the framework, we constructed the EduTalk dataset, encompassing diverse classroom dialogues labeled with pedagogical categories. Extensive experiments on EduTalk and publicly available datasets, combined with expert evaluations, confirm the superior quality of EduDCM-generated datasets. Models trained on EduDCM data achieved a performance comparable to that of manually annotated datasets. Expert evaluations using a 5-point Likert scale show that EduDCM outperforms Template-Based Generation and Few-Shot GPT in terms of annotation accuracy, category coverage, and consistency. These findings emphasize EduDCM’s novelty and its effectiveness in generating high-quality, scalable datasets for low-resource educational NLP tasks, thus reducing manual annotation efforts.https://www.mdpi.com/2076-3417/15/1/154educational dialogue classificationlow-resource taskslarge language modelsdistant supervision
spellingShingle Changyong Qi
Longwei Zheng
Yuang Wei
Haoxin Xu
Peiji Chen
Xiaoqing Gu
EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
Applied Sciences
educational dialogue classification
low-resource tasks
large language models
distant supervision
title EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
title_full EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
title_fullStr EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
title_full_unstemmed EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
title_short EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
title_sort edudcm a novel framework for automatic educational dialogue classification dataset construction via distant supervision and large language models
topic educational dialogue classification
low-resource tasks
large language models
distant supervision
url https://www.mdpi.com/2076-3417/15/1/154
work_keys_str_mv AT changyongqi edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels
AT longweizheng edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels
AT yuangwei edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels
AT haoxinxu edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels
AT peijichen edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels
AT xiaoqinggu edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels