A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework

Abstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ...

Full description

Saved in:
Bibliographic Details
Main Authors: Mariam Essam Abdelaziz, Mohanad A. Deif, Shabbab Ali Algamdi, Rania Elgohary
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05688-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226661870960640
author Mariam Essam Abdelaziz
Mohanad A. Deif
Shabbab Ali Algamdi
Rania Elgohary
author_facet Mariam Essam Abdelaziz
Mohanad A. Deif
Shabbab Ali Algamdi
Rania Elgohary
author_sort Mariam Essam Abdelaziz
collection DOAJ
description Abstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ framework that has11 linguistic and cognitive aspects, e.g., Question Particle, Question Particle Type, Intent, Answer Type, Cognitive Level, and Temporal Context. Based on the AAFAQ Framework (Arabic Analytical Framework for Advanced Questions), the dataset is designed to support semantic and cognitive understanding for Arabic Question Classification and related tasks. The dataset’s effectiveness was validated by fine-tuning state-of-the-art models. AraBERT achieved 100% accuracy on Question Particle Type classification and 94.95% on Intent classification. Integration within a generative question-answering system with Alpaca + Gemma-9B Unsloth improved evaluation metrics, including BLEU (+37.6%), ROUGE-1 (+132%), and BERTScore (+17.3%), validating the dataset’s value in both classification and generation tasks. Despite its broad coverage, the dataset includes underrepresented categories, e.g., Sociology and Volunteering, to be considered in future extensions. AAFAQ is a foundation benchmark for the advancement of Arabic question comprehension, with prospective applications in education, cognitive computing, and multilingual AI system creation.
format Article
id doaj-art-97c4bca934e44c4daec232eb7a1a864a
institution Kabale University
issn 2052-4463
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-97c4bca934e44c4daec232eb7a1a864a2025-08-24T11:07:35ZengNature PortfolioScientific Data2052-44632025-08-0112111110.1038/s41597-025-05688-0A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ FrameworkMariam Essam Abdelaziz0Mohanad A. Deif1Shabbab Ali Algamdi2Rania Elgohary3Department of Computer Science, College of Information Technology, Misr University for Science and Technology (MUST)Department of Computer Science, College of Information Technology, Misr University for Science and Technology (MUST)Department of Software Engineering, College of Computer Science and Engineering, Prince Sattam bin Abdulaziz UniversityFaculty of Computer and Information Sciences, Ain Shams UniversityAbstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ framework that has11 linguistic and cognitive aspects, e.g., Question Particle, Question Particle Type, Intent, Answer Type, Cognitive Level, and Temporal Context. Based on the AAFAQ Framework (Arabic Analytical Framework for Advanced Questions), the dataset is designed to support semantic and cognitive understanding for Arabic Question Classification and related tasks. The dataset’s effectiveness was validated by fine-tuning state-of-the-art models. AraBERT achieved 100% accuracy on Question Particle Type classification and 94.95% on Intent classification. Integration within a generative question-answering system with Alpaca + Gemma-9B Unsloth improved evaluation metrics, including BLEU (+37.6%), ROUGE-1 (+132%), and BERTScore (+17.3%), validating the dataset’s value in both classification and generation tasks. Despite its broad coverage, the dataset includes underrepresented categories, e.g., Sociology and Volunteering, to be considered in future extensions. AAFAQ is a foundation benchmark for the advancement of Arabic question comprehension, with prospective applications in education, cognitive computing, and multilingual AI system creation.https://doi.org/10.1038/s41597-025-05688-0
spellingShingle Mariam Essam Abdelaziz
Mohanad A. Deif
Shabbab Ali Algamdi
Rania Elgohary
A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
Scientific Data
title A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_full A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_fullStr A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_full_unstemmed A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_short A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_sort benchmark arabic dataset for arabic question classification using aafaq framework
url https://doi.org/10.1038/s41597-025-05688-0
work_keys_str_mv AT mariamessamabdelaziz abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT mohanadadeif abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT shabbabalialgamdi abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT raniaelgohary abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT mariamessamabdelaziz benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT mohanadadeif benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT shabbabalialgamdi benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework
AT raniaelgohary benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework