Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data

The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted...

Full description

Saved in:
Bibliographic Details
Main Authors: Ghadeer Qasim Ali, Husam Ali Abdulmohsin
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924011636
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576465596579840
author Ghadeer Qasim Ali
Husam Ali Abdulmohsin
author_facet Ghadeer Qasim Ali
Husam Ali Abdulmohsin
author_sort Ghadeer Qasim Ali
collection DOAJ
description The availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted at the beginning of our study indicated a significant lack of Arabic speech datasets. Therefore, this study aims to address this problem by proposing a new Arabic and English dataset called Ghadeer-Speech-Crowd-Corpus. This dataset was designed to target more than one branch of speech-processing applications, such as crowd speaker identification, speech synthesis (text-to-speech), and speech recognition (speech-to-text). Speech samples were recorded over three months from 210 Iraqi Arab citizens living in different parts of Iraq and included more than one accent. The proposed dataset was fully balanced with respect to sex and recordings (same number of Arabic and English recordings). Additionally, it is a mono dataset and contains 15,626 audio samples recorded at a sampling rate of 44,100 Hz, 16-bit depth, and bit rate of 705.6 kb/s. The recordings were conducted at the Academy for Media Training of the College of Media, University of Baghdad.
format Article
id doaj-art-41263c4f93e34804acdc1726848b28e7
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-41263c4f93e34804acdc1726848b28e72025-01-31T05:11:29ZengElsevierData in Brief2352-34092025-02-0158111201Ghadeer-speech-crowd-corpus: Speech datasetMendeley DataGhadeer Qasim Ali0Husam Ali Abdulmohsin1Computer Science Department, College of Science, University of Baghdad, IraqCorresponding author.; Computer Science Department, College of Science, University of Baghdad, IraqThe availability of raw data is a considerable challenge across most branches of science. In the absence of data, neither experiments can be conducted nor development can be undertaken. Despite their importance, raw data are still lacking across many scientific fields. A literature survey conducted at the beginning of our study indicated a significant lack of Arabic speech datasets. Therefore, this study aims to address this problem by proposing a new Arabic and English dataset called Ghadeer-Speech-Crowd-Corpus. This dataset was designed to target more than one branch of speech-processing applications, such as crowd speaker identification, speech synthesis (text-to-speech), and speech recognition (speech-to-text). Speech samples were recorded over three months from 210 Iraqi Arab citizens living in different parts of Iraq and included more than one accent. The proposed dataset was fully balanced with respect to sex and recordings (same number of Arabic and English recordings). Additionally, it is a mono dataset and contains 15,626 audio samples recorded at a sampling rate of 44,100 Hz, 16-bit depth, and bit rate of 705.6 kb/s. The recordings were conducted at the Academy for Media Training of the College of Media, University of Baghdad.http://www.sciencedirect.com/science/article/pii/S2352340924011636Arabic phraseEnglish phraseSpeech recognitionLow-resource languages
spellingShingle Ghadeer Qasim Ali
Husam Ali Abdulmohsin
Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
Data in Brief
Arabic phrase
English phrase
Speech recognition
Low-resource languages
title Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_full Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_fullStr Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_full_unstemmed Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_short Ghadeer-speech-crowd-corpus: Speech datasetMendeley Data
title_sort ghadeer speech crowd corpus speech datasetmendeley data
topic Arabic phrase
English phrase
Speech recognition
Low-resource languages
url http://www.sciencedirect.com/science/article/pii/S2352340924011636
work_keys_str_mv AT ghadeerqasimali ghadeerspeechcrowdcorpusspeechdatasetmendeleydata
AT husamaliabdulmohsin ghadeerspeechcrowdcorpusspeechdatasetmendeleydata