A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework

Abstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mariam Essam Abdelaziz, Mohanad A. Deif, Shabbab Ali Algamdi, Rania Elgohary
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-08-01
Series:	Scientific Data
Online Access:	https://doi.org/10.1038/s41597-025-05688-0
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849226661870960640
author	Mariam Essam Abdelaziz Mohanad A. Deif Shabbab Ali Algamdi Rania Elgohary
author_facet	Mariam Essam Abdelaziz Mohanad A. Deif Shabbab Ali Algamdi Rania Elgohary
author_sort	Mariam Essam Abdelaziz
collection	DOAJ
description	Abstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ framework that has11 linguistic and cognitive aspects, e.g., Question Particle, Question Particle Type, Intent, Answer Type, Cognitive Level, and Temporal Context. Based on the AAFAQ Framework (Arabic Analytical Framework for Advanced Questions), the dataset is designed to support semantic and cognitive understanding for Arabic Question Classification and related tasks. The dataset’s effectiveness was validated by fine-tuning state-of-the-art models. AraBERT achieved 100% accuracy on Question Particle Type classification and 94.95% on Intent classification. Integration within a generative question-answering system with Alpaca + Gemma-9B Unsloth improved evaluation metrics, including BLEU (+37.6%), ROUGE-1 (+132%), and BERTScore (+17.3%), validating the dataset’s value in both classification and generation tasks. Despite its broad coverage, the dataset includes underrepresented categories, e.g., Sociology and Volunteering, to be considered in future extensions. AAFAQ is a foundation benchmark for the advancement of Arabic question comprehension, with prospective applications in education, cognitive computing, and multilingual AI system creation.
format	Article
id	doaj-art-97c4bca934e44c4daec232eb7a1a864a
institution	Kabale University
issn	2052-4463
language	English
publishDate	2025-08-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Data
spelling	doaj-art-97c4bca934e44c4daec232eb7a1a864a2025-08-24T11:07:35ZengNature PortfolioScientific Data2052-44632025-08-0112111110.1038/s41597-025-05688-0A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ FrameworkMariam Essam Abdelaziz0Mohanad A. Deif1Shabbab Ali Algamdi2Rania Elgohary3Department of Computer Science, College of Information Technology, Misr University for Science and Technology (MUST)Department of Computer Science, College of Information Technology, Misr University for Science and Technology (MUST)Department of Software Engineering, College of Computer Science and Engineering, Prince Sattam bin Abdulaziz UniversityFaculty of Computer and Information Sciences, Ain Shams UniversityAbstract Arabic Natural Language Processing (NLP) is still faced with the complexity of the language’s morphology and the limited availability of quality annotated resources. In this paper, we introduce an open-domain dataset of 5,009 Modern Standard Arabic (MSA) questions labeled according to AAFAQ framework that has11 linguistic and cognitive aspects, e.g., Question Particle, Question Particle Type, Intent, Answer Type, Cognitive Level, and Temporal Context. Based on the AAFAQ Framework (Arabic Analytical Framework for Advanced Questions), the dataset is designed to support semantic and cognitive understanding for Arabic Question Classification and related tasks. The dataset’s effectiveness was validated by fine-tuning state-of-the-art models. AraBERT achieved 100% accuracy on Question Particle Type classification and 94.95% on Intent classification. Integration within a generative question-answering system with Alpaca + Gemma-9B Unsloth improved evaluation metrics, including BLEU (+37.6%), ROUGE-1 (+132%), and BERTScore (+17.3%), validating the dataset’s value in both classification and generation tasks. Despite its broad coverage, the dataset includes underrepresented categories, e.g., Sociology and Volunteering, to be considered in future extensions. AAFAQ is a foundation benchmark for the advancement of Arabic question comprehension, with prospective applications in education, cognitive computing, and multilingual AI system creation.https://doi.org/10.1038/s41597-025-05688-0
spellingShingle	Mariam Essam Abdelaziz Mohanad A. Deif Shabbab Ali Algamdi Rania Elgohary A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework Scientific Data
title	A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_full	A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_fullStr	A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_full_unstemmed	A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_short	A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework
title_sort	benchmark arabic dataset for arabic question classification using aafaq framework
url	https://doi.org/10.1038/s41597-025-05688-0
work_keys_str_mv	AT mariamessamabdelaziz abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT mohanadadeif abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT shabbabalialgamdi abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT raniaelgohary abenchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT mariamessamabdelaziz benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT mohanadadeif benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT shabbabalialgamdi benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework AT raniaelgohary benchmarkarabicdatasetforarabicquestionclassificationusingaafaqframework

A Benchmark Arabic Dataset for Arabic Question Classification using AAFAQ Framework

Similar Items