An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetr...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-06-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S235234092500188X |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849725703062290432 |
|---|---|
| author | Andreas Hallberg |
| author_facet | Andreas Hallberg |
| author_sort | Andreas Hallberg |
| collection | DOAJ |
| description | This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models. |
| format | Article |
| id | doaj-art-a3539ea8e6d24c57b60716ec05446eba |
| institution | DOAJ |
| issn | 2352-3409 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Data in Brief |
| spelling | doaj-art-a3539ea8e6d24c57b60716ec05446eba2025-08-20T03:10:24ZengElsevierData in Brief2352-34092025-06-016011145610.1016/j.dib.2025.111456An 81-million-word multi-genre corpus of Arabic booksSwedish National Data SeriviceAndreas Hallberg0University of Gothenburg, Department of Languages and Literatures, Box 200, 40530 Gothenburg, SwedenThis article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.http://www.sciencedirect.com/science/article/pii/S235234092500188XArabicCorpus linguisticsGenre |
| spellingShingle | Andreas Hallberg An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice Data in Brief Arabic Corpus linguistics Genre |
| title | An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice |
| title_full | An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice |
| title_fullStr | An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice |
| title_full_unstemmed | An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice |
| title_short | An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice |
| title_sort | 81 million word multi genre corpus of arabic booksswedish national data serivice |
| topic | Arabic Corpus linguistics Genre |
| url | http://www.sciencedirect.com/science/article/pii/S235234092500188X |
| work_keys_str_mv | AT andreashallberg an81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice AT andreashallberg 81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice |