An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice

This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetr...

Full description

Saved in:

Bibliographic Details
Main Author:	Andreas Hallberg
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Data in Brief
Subjects:	Arabic Corpus linguistics Genre
Online Access:	http://www.sciencedirect.com/science/article/pii/S235234092500188X
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849725703062290432
author	Andreas Hallberg
author_facet	Andreas Hallberg
author_sort	Andreas Hallberg
collection	DOAJ
description	This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.
format	Article
id	doaj-art-a3539ea8e6d24c57b60716ec05446eba
institution	DOAJ
issn	2352-3409
language	English
publishDate	2025-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-a3539ea8e6d24c57b60716ec05446eba2025-08-20T03:10:24ZengElsevierData in Brief2352-34092025-06-016011145610.1016/j.dib.2025.111456An 81-million-word multi-genre corpus of Arabic booksSwedish National Data SeriviceAndreas Hallberg0University of Gothenburg, Department of Languages and Literatures, Box 200, 40530 Gothenburg, SwedenThis article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.http://www.sciencedirect.com/science/article/pii/S235234092500188XArabicCorpus linguisticsGenre
spellingShingle	Andreas Hallberg An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice Data in Brief Arabic Corpus linguistics Genre
title	An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_full	An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_fullStr	An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_full_unstemmed	An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_short	An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice
title_sort	81 million word multi genre corpus of arabic booksswedish national data serivice
topic	Arabic Corpus linguistics Genre
url	http://www.sciencedirect.com/science/article/pii/S235234092500188X
work_keys_str_mv	AT andreashallberg an81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice AT andreashallberg 81millionwordmultigenrecorpusofarabicbooksswedishnationaldataserivice

An 81-million-word multi-genre corpus of Arabic booksSwedish National Data Serivice

Similar Items