Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data

The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ghadeer Qasim Ali, Husam Ali Abdulmohsin
Format:	Article
Language:	English
Published:	Elsevier 2025-02-01
Series:	Data in Brief
Subjects:	Arabic phrase English phrase Speech recognition Low-resource languages
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340924011636
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576465596579840
author	Ghadeer Qasim Ali Husam Ali Abdulmohsin
author_facet	Ghadeer Qasim Ali Husam Ali Abdulmohsin
author_sort	Ghadeer Qasim Ali
collection	DOAJ
description	The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted at the beginning of our study indicated a significant lack of Arabic speech datasets. Therefore, this study aims to address this problem by proposing a new Arabic and English dataset called Ghadeer-Speech-Crowd-Corpus. This dataset was designed to target more than one branch of speech-processing applications, such as crowd speaker identification, speech synthesis (text-to-speech), and speech recognition (speech-to-text). Speech samples were recorded over three months from 210 Iraqi Arab citizens living in different parts of Iraq and included more than one accent. The proposed dataset was fully balanced with respect to sex and recordings (same number of Arabic and English recordings). Additionally, it is a mono dataset and contains 15,626 audio samples recorded at a sampling rate of 44,100 Hz, 16-bit depth, and bit rate of 705.6 kb/s. The recordings were conducted at the Academy for Media Training of the College of Media, University of Baghdad.
format	Article
id	doaj-art-41263c4f93e34804acdc1726848b28e7
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-02-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-41263c4f93e34804acdc1726848b28e72025-01-31T05:11:29ZengElsevierData in Brief2352-34092025-02-0158111201Ghadeer-speech-crowd-corpus: Speech datasetMendeley DataGhadeer Qasim Ali0Husam Ali Abdulmohsin1Computer Science Department, College of Science, University of Baghdad, IraqCorresponding author.; Computer Science Department, College of Science, University of Baghdad, IraqThe availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted at the beginning of our study indicated a significant lack of Arabic speech datasets. Therefore, this study aims to address this problem by proposing a new Arabic and English dataset called Ghadeer-Speech-Crowd-Corpus. This dataset was designed to target more than one branch of speech-processing applications, such as crowd speaker identification, speech synthesis (text-to-speech), and speech recognition (speech-to-text). Speech samples were recorded over three months from 210 Iraqi Arab citizens living in different parts of Iraq and included more than one accent. The proposed dataset was fully balanced with respect to sex and recordings (same number of Arabic and English recordings). Additionally, it is a mono dataset and contains 15,626 audio samples recorded at a sampling rate of 44,100 Hz, 16-bit depth, and bit rate of 705.6 kb/s. The recordings were conducted at the Academy for Media Training of the College of Media, University of Baghdad.http://www.sciencedirect.com/science/article/pii/S2352340924011636Arabic phraseEnglish phraseSpeech recognitionLow-resource languages
spellingShingle	Ghadeer Qasim Ali Husam Ali Abdulmohsin Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data Data in Brief Arabic phrase English phrase Speech recognition Low-resource languages
title	Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_full	Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_fullStr	Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_full_unstemmed	Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_short	Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_sort	ghadeer speech crowd corpus speech datasetmendeley data
topic	Arabic phrase English phrase Speech recognition Low-resource languages
url	http://www.sciencedirect.com/science/article/pii/S2352340924011636
work_keys_str_mv	AT ghadeerqasimali ghadeerspeechcrowdcorpusspeechdatasetmendeleydata AT husamaliabdulmohsin ghadeerspeechcrowdcorpusspeechdatasetmendeleydata

Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data

Similar Items