“Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific chal...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10925379/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850214203065892864 |
|---|---|
| author | Badriyya B. Al-Onazi Wadee A. Nashir Asma A. Al-Shargabi |
| author_facet | Badriyya B. Al-Onazi Wadee A. Nashir Asma A. Al-Shargabi |
| author_sort | Badriyya B. Al-Onazi |
| collection | DOAJ |
| description | In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific challenges when dealing with Arabic poetry, which is characterized by complex metrical structures, varied themes, and unique linguistic intricacies. Currently available poetry corpora are often limited in scope and depth, and typically lack the comprehensive coverage required for sophisticated computational analysis tasks. To address this gap, this paper introduces “Diwan,” the largest and most precise annotated Arabic poetry dataset/corpus, designed to support scientific research and facilitate the development of AI applications in this domain. “Diwan” comprises approximately 14 million verses of poetry across 16 major categories and includes detailed annotations related to poetic genres, prosodic structures, thematic content, linguistic patterns, and poet-specific metadata. This dataset/corpus has been meticulously curated using advanced data collection methods, rigorous normalization and annotation protocols, and expert oversight from specialists in Arabic prosody and poetry. By leveraging intelligent annotation algorithms, Diwan serves as a foundational resource and benchmark dataset for advancing research in fields such as automatic poetry generation, metrical analysis, thematic classification, and plagiarism detection. “Diwan” has been compared against 4 leading corpora for Arabic poetry; this comparison proves that “Diwan” outperforms all of them in terms of scope, and depth. Besides, “Diwan” is constructed and structured in a way enables AI-powered analysis and deep-learning-based analysis to work accurately. By providing an unprecedented foundation for computational exploration of Arabic poetry, Diwan opens up new avenues and possibilities for intelligent applications and promotes the digital analysis of this rich literary heritage. |
| format | Article |
| id | doaj-art-98f2eb4f9203424cb4eb407d16b882f6 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-98f2eb4f9203424cb4eb407d16b882f62025-08-20T02:08:57ZengIEEEIEEE Access2169-35362025-01-0113589275894110.1109/ACCESS.2025.355116110925379“Diwan”: Constructing the Largest Annotated Corpus for Arabic PoetryBadriyya B. Al-Onazi0Wadee A. Nashir1Asma A. Al-Shargabi2https://orcid.org/0000-0002-6572-2923Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, University of Science and Technology, Sana’a, YemenDepartment of Information Technology, Collage of Computer, Qassim University, Buraydah, Saudi ArabiaIn recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific challenges when dealing with Arabic poetry, which is characterized by complex metrical structures, varied themes, and unique linguistic intricacies. Currently available poetry corpora are often limited in scope and depth, and typically lack the comprehensive coverage required for sophisticated computational analysis tasks. To address this gap, this paper introduces “Diwan,” the largest and most precise annotated Arabic poetry dataset/corpus, designed to support scientific research and facilitate the development of AI applications in this domain. “Diwan” comprises approximately 14 million verses of poetry across 16 major categories and includes detailed annotations related to poetic genres, prosodic structures, thematic content, linguistic patterns, and poet-specific metadata. This dataset/corpus has been meticulously curated using advanced data collection methods, rigorous normalization and annotation protocols, and expert oversight from specialists in Arabic prosody and poetry. By leveraging intelligent annotation algorithms, Diwan serves as a foundational resource and benchmark dataset for advancing research in fields such as automatic poetry generation, metrical analysis, thematic classification, and plagiarism detection. “Diwan” has been compared against 4 leading corpora for Arabic poetry; this comparison proves that “Diwan” outperforms all of them in terms of scope, and depth. Besides, “Diwan” is constructed and structured in a way enables AI-powered analysis and deep-learning-based analysis to work accurately. By providing an unprecedented foundation for computational exploration of Arabic poetry, Diwan opens up new avenues and possibilities for intelligent applications and promotes the digital analysis of this rich literary heritage.https://ieeexplore.ieee.org/document/10925379/Annotated Arabic poetry corpusArabic poetry processingprosodic analysisnatural language processing systemsartificial intelligence |
| spellingShingle | Badriyya B. Al-Onazi Wadee A. Nashir Asma A. Al-Shargabi “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry IEEE Access Annotated Arabic poetry corpus Arabic poetry processing prosodic analysis natural language processing systems artificial intelligence |
| title | “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry |
| title_full | “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry |
| title_fullStr | “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry |
| title_full_unstemmed | “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry |
| title_short | “Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry |
| title_sort | x201c diwan x201d constructing the largest annotated corpus for arabic poetry |
| topic | Annotated Arabic poetry corpus Arabic poetry processing prosodic analysis natural language processing systems artificial intelligence |
| url | https://ieeexplore.ieee.org/document/10925379/ |
| work_keys_str_mv | AT badriyyabalonazi x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry AT wadeeanashir x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry AT asmaaalshargabi x201cdiwanx201dconstructingthelargestannotatedcorpusforarabicpoetry |