An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice

This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetr...

Full description

Saved in:
Bibliographic Details
Main Author: Andreas Hallberg
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S235234092500188X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849725703062290432
author Andreas Hallberg
author_facet Andreas Hallberg
author_sort Andreas Hallberg
collection DOAJ
description This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.
format Article
id doaj-art-a3539ea8e6d24c57b60716ec05446eba
institution DOAJ
issn 2352-3409
language English
publishDate 2025-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-a3539ea8e6d24c57b60716ec05446eba2025-08-20T03:10:24ZengElsevierData in Brief2352-34092025-06-016011145610.1016/j.dib.2025.111456An 81-million-word multi-genre corpus of Arabic booksSwedish National Data SeriviceAndreas Hallberg0University of Gothenburg, Department of Languages and Literatures, Box 200, 40530 Gothenburg, SwedenThis article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.http://www.sciencedirect.com/science/article/pii/S235234092500188XArabicCorpus linguisticsGenre
spellingShingle Andreas Hallberg
An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
Data in Brief
Arabic
Corpus linguistics
Genre
title An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_full An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_fullStr An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_full_unstemmed An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_short An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_sort 81 million word multi genre corpus of arabic booksswedish national data serivice
topic Arabic
Corpus linguistics
Genre
url http://www.sciencedirect.com/science/article/pii/S235234092500188X
work_keys_str_mv AT andreashallberg an81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice
AT andreashallberg 81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice