EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models
Educational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This art...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-12-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/15/1/154 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841549433693536256 |
---|---|
author | Changyong Qi Longwei Zheng Yuang Wei Haoxin Xu Peiji Chen Xiaoqing Gu |
author_facet | Changyong Qi Longwei Zheng Yuang Wei Haoxin Xu Peiji Chen Xiaoqing Gu |
author_sort | Changyong Qi |
collection | DOAJ |
description | Educational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This article presents the EduDCM framework for the first time, offering an original approach to addressing these challenges. EduDCM innovatively integrates distant supervision with the capabilities of Large Language Models (LLMs) to automate the construction of high-quality educational dialogue classification datasets. EduDCM reduces the noise typically associated with distant supervision by leveraging LLMs for context-aware label generation and incorporating heuristic alignment techniques. To validate the framework, we constructed the EduTalk dataset, encompassing diverse classroom dialogues labeled with pedagogical categories. Extensive experiments on EduTalk and publicly available datasets, combined with expert evaluations, confirm the superior quality of EduDCM-generated datasets. Models trained on EduDCM data achieved a performance comparable to that of manually annotated datasets. Expert evaluations using a 5-point Likert scale show that EduDCM outperforms Template-Based Generation and Few-Shot GPT in terms of annotation accuracy, category coverage, and consistency. These findings emphasize EduDCM’s novelty and its effectiveness in generating high-quality, scalable datasets for low-resource educational NLP tasks, thus reducing manual annotation efforts. |
format | Article |
id | doaj-art-339733682edb4125b9170b22ed296ce4 |
institution | Kabale University |
issn | 2076-3417 |
language | English |
publishDate | 2024-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj-art-339733682edb4125b9170b22ed296ce42025-01-10T13:14:38ZengMDPI AGApplied Sciences2076-34172024-12-0115115410.3390/app15010154EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language ModelsChangyong Qi0Longwei Zheng1Yuang Wei2Haoxin Xu3Peiji Chen4Xiaoqing Gu5Shanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaSchool of Education, City University of Macau, Macau 999078, ChinaShanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaShanghai Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaDepartment of Mechanical Engineering and Intelligent System, The University of Electro-Communications, Tokyo 183-8585, JapanDepartment of Education Information Technology, East China Normal University, Shanghai 200062, ChinaEducational dialogue classification is a critical task for analyzing classroom interactions and fostering effective teaching strategies. However, the scarcity of annotated data and the high cost of manual labeling pose significant challenges, especially in low-resource educational contexts. This article presents the EduDCM framework for the first time, offering an original approach to addressing these challenges. EduDCM innovatively integrates distant supervision with the capabilities of Large Language Models (LLMs) to automate the construction of high-quality educational dialogue classification datasets. EduDCM reduces the noise typically associated with distant supervision by leveraging LLMs for context-aware label generation and incorporating heuristic alignment techniques. To validate the framework, we constructed the EduTalk dataset, encompassing diverse classroom dialogues labeled with pedagogical categories. Extensive experiments on EduTalk and publicly available datasets, combined with expert evaluations, confirm the superior quality of EduDCM-generated datasets. Models trained on EduDCM data achieved a performance comparable to that of manually annotated datasets. Expert evaluations using a 5-point Likert scale show that EduDCM outperforms Template-Based Generation and Few-Shot GPT in terms of annotation accuracy, category coverage, and consistency. These findings emphasize EduDCM’s novelty and its effectiveness in generating high-quality, scalable datasets for low-resource educational NLP tasks, thus reducing manual annotation efforts.https://www.mdpi.com/2076-3417/15/1/154educational dialogue classificationlow-resource taskslarge language modelsdistant supervision |
spellingShingle | Changyong Qi Longwei Zheng Yuang Wei Haoxin Xu Peiji Chen Xiaoqing Gu EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models Applied Sciences educational dialogue classification low-resource tasks large language models distant supervision |
title | EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models |
title_full | EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models |
title_fullStr | EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models |
title_full_unstemmed | EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models |
title_short | EduDCM: A Novel Framework for Automatic Educational Dialogue Classification Dataset Construction via Distant Supervision and Large Language Models |
title_sort | edudcm a novel framework for automatic educational dialogue classification dataset construction via distant supervision and large language models |
topic | educational dialogue classification low-resource tasks large language models distant supervision |
url | https://www.mdpi.com/2076-3417/15/1/154 |
work_keys_str_mv | AT changyongqi edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels AT longweizheng edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels AT yuangwei edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels AT haoxinxu edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels AT peijichen edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels AT xiaoqinggu edudcmanovelframeworkforautomaticeducationaldialogueclassificationdatasetconstructionviadistantsupervisionandlargelanguagemodels |