“Diwan”: Constructing the Largest Annotated Corpus for Arabic Poetry
In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific chal...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10925379/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In recent years, Arabic natural language processing has achieved remarkable advancements, particularly with the advent of large language models that have enhanced the analysis of diverse Arabic texts, including literary and artistic works such as poetry. However, these models encounter specific challenges when dealing with Arabic poetry, which is characterized by complex metrical structures, varied themes, and unique linguistic intricacies. Currently available poetry corpora are often limited in scope and depth, and typically lack the comprehensive coverage required for sophisticated computational analysis tasks. To address this gap, this paper introduces “Diwan,” the largest and most precise annotated Arabic poetry dataset/corpus, designed to support scientific research and facilitate the development of AI applications in this domain. “Diwan” comprises approximately 14 million verses of poetry across 16 major categories and includes detailed annotations related to poetic genres, prosodic structures, thematic content, linguistic patterns, and poet-specific metadata. This dataset/corpus has been meticulously curated using advanced data collection methods, rigorous normalization and annotation protocols, and expert oversight from specialists in Arabic prosody and poetry. By leveraging intelligent annotation algorithms, Diwan serves as a foundational resource and benchmark dataset for advancing research in fields such as automatic poetry generation, metrical analysis, thematic classification, and plagiarism detection. “Diwan” has been compared against 4 leading corpora for Arabic poetry; this comparison proves that “Diwan” outperforms all of them in terms of scope, and depth. Besides, “Diwan” is constructed and structured in a way enables AI-powered analysis and deep-learning-based analysis to work accurately. By providing an unprecedented foundation for computational exploration of Arabic poetry, Diwan opens up new avenues and possibilities for intelligent applications and promotes the digital analysis of this rich literary heritage. |
|---|---|
| ISSN: | 2169-3536 |