“Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry

In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific chal...

Full description

Saved in:
Bibliographic Details
Main Authors: Badriyya B. Al-Onazi, Wadee A. Nashir, Asma A. Al-Shargabi
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10925379/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850214203065892864
author Badriyya B. Al-Onazi
Wadee A. Nashir
Asma A. Al-Shargabi
author_facet Badriyya B. Al-Onazi
Wadee A. Nashir
Asma A. Al-Shargabi
author_sort Badriyya B. Al-Onazi
collection DOAJ
description In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific challenges when dealing with Arabic poetry, which is characterized by complex metrical structures, varied themes, and unique linguistic intricacies. Currently available poetry corpora are often limited in scope and depth, and typically lack the comprehensive coverage required for sophisticated computational analysis tasks. To address this gap, this paper introduces “Diwan,” the largest and most precise annotated Arabic poetry dataset/corpus, designed to support scientific research and facilitate the development of AI applications in this domain. “Diwan” comprises approximately 14 million verses of poetry across 16 major categories and includes detailed annotations related to poetic genres, prosodic structures, thematic content, linguistic patterns, and poet-specific metadata. This dataset/corpus has been meticulously curated using advanced data collection methods, rigorous normalization and annotation protocols, and expert oversight from specialists in Arabic prosody and poetry. By leveraging intelligent annotation algorithms, Diwan serves as a foundational resource and benchmark dataset for advancing research in fields such as automatic poetry generation, metrical analysis, thematic classification, and plagiarism detection. “Diwan” has been compared against 4 leading corpora for Arabic poetry; this comparison proves that “Diwan” outperforms all of them in terms of scope, and depth. Besides, “Diwan” is constructed and structured in a way enables AI-powered analysis and deep-learning-based analysis to work accurately. By providing an unprecedented foundation for computational exploration of Arabic poetry, Diwan opens up new avenues and possibilities for intelligent applications and promotes the digital analysis of this rich literary heritage.
format Article
id doaj-art-98f2eb4f9203424cb4eb407d16b882f6
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-98f2eb4f9203424cb4eb407d16b882f62025-08-20T02:08:57ZengIEEEIEEE Access2169-35362025-01-0113589275894110.1109/ACCESS.2025.355116110925379“Diwan”: Constructing the Largest Annotated Corpus for Arabic PoetryBadriyya B. Al-Onazi0Wadee A. Nashir1Asma A. Al-Shargabi2https://orcid.org/0000-0002-6572-2923Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, Sana’a, YemenDepartment of Information Technology, Collage of Computer, Qassim University, Buraydah, Saudi ArabiaIn recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific challenges when dealing with Arabic poetry, which is characterized by complex metrical structures, varied themes, and unique linguistic intricacies. Currently available poetry corpora are often limited in scope and depth, and typically lack the comprehensive coverage required for sophisticated computational analysis tasks. To address this gap, this paper introduces “Diwan,” the largest and most precise annotated Arabic poetry dataset/corpus, designed to support scientific research and facilitate the development of AI applications in this domain. “Diwan” comprises approximately 14 million verses of poetry across 16 major categories and includes detailed annotations related to poetic genres, prosodic structures, thematic content, linguistic patterns, and poet-specific metadata. This dataset/corpus has been meticulously curated using advanced data collection methods, rigorous normalization and annotation protocols, and expert oversight from specialists in Arabic prosody and poetry. By leveraging intelligent annotation algorithms, Diwan serves as a foundational resource and benchmark dataset for advancing research in fields such as automatic poetry generation, metrical analysis, thematic classification, and plagiarism detection. “Diwan” has been compared against 4 leading corpora for Arabic poetry; this comparison proves that “Diwan” outperforms all of them in terms of scope, and depth. Besides, “Diwan” is constructed and structured in a way enables AI-powered analysis and deep-learning-based analysis to work accurately. By providing an unprecedented foundation for computational exploration of Arabic poetry, Diwan opens up new avenues and possibilities for intelligent applications and promotes the digital analysis of this rich literary heritage.https://ieeexplore.ieee.org/document/10925379/Annotated Arabic poetry corpusArabic poetry processingprosodic analysisnatural language processing systemsartificial intelligence
spellingShingle Badriyya B. Al-Onazi
Wadee A. Nashir
Asma A. Al-Shargabi
“Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
IEEE Access
Annotated Arabic poetry corpus
Arabic poetry processing
prosodic analysis
natural language processing systems
artificial intelligence
title “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
title_full “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
title_fullStr “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
title_full_unstemmed “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
title_short “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
title_sort x201c diwan x201d constructing the largest annotated corpus for arabic poetry
topic Annotated Arabic poetry corpus
Arabic poetry processing
prosodic analysis
natural language processing systems
artificial intelligence
url https://ieeexplore.ieee.org/document/10925379/
work_keys_str_mv AT badriyyabalonazi x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry
AT wadeeanashir x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry
AT asmaaalshargabi x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry