An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice

This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetr...

Full description

Saved in:
Bibliographic Details
Main Author: Andreas Hallberg
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S235234092500188X
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.
ISSN:2352-3409