CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evalu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Speech enhancement model adaptation background noise conversion deep learning mobile application
Online Access:	https://ieeexplore.ieee.org/document/9718270/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849223135104073728
author	Yu-Wen Chen Kuo-Hsuan Hung You-Jin Li Alexander Chao-Fu Kang Ya-Hsin Lai Kai-Chun Liu Szu-Wei Fu Syu-Siang Wang Yu Tsao
author_facet	Yu-Wen Chen Kuo-Hsuan Hung You-Jin Li Alexander Chao-Fu Kang Ya-Hsin Lai Kai-Chun Liu Szu-Wei Fu Syu-Siang Wang Yu Tsao
author_sort	Yu-Wen Chen
collection	DOAJ
description	This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from prerecordings or instant recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for BNC, CITISEN removes the original background noise using an SE model and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals by SE achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals of original and converted backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.
format	Article
id	doaj-art-b63b9b86615e46eb8293e39aa85c25fb
institution	Kabale University
issn	2169-3536
language	English
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-b63b9b86615e46eb8293e39aa85c25fb2025-08-25T23:00:25ZengIEEEIEEE Access2169-35362022-01-0110460824609910.1109/ACCESS.2022.31534699718270CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile ApplicationYu-Wen Chen0https://orcid.org/0000-0002-7473-0570Kuo-Hsuan Hung1You-Jin Li2Alexander Chao-Fu Kang3https://orcid.org/0000-0001-7625-4910Ya-Hsin Lai4https://orcid.org/0000-0002-6286-8441Kai-Chun Liu5https://orcid.org/0000-0001-7867-4716Szu-Wei Fu6Syu-Siang Wang7https://orcid.org/0000-0002-2652-5521Yu Tsao8https://orcid.org/0000-0001-6956-0418Research Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanThis study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from prerecordings or instant recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for BNC, CITISEN removes the original background noise using an SE model and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals by SE achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals of original and converted backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.https://ieeexplore.ieee.org/document/9718270/Speech enhancementmodel adaptationbackground noise conversiondeep learningmobile application
spellingShingle	Yu-Wen Chen Kuo-Hsuan Hung You-Jin Li Alexander Chao-Fu Kang Ya-Hsin Lai Kai-Chun Liu Szu-Wei Fu Syu-Siang Wang Yu Tsao CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application IEEE Access Speech enhancement model adaptation background noise conversion deep learning mobile application
title	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_full	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_fullStr	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_full_unstemmed	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_short	CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_sort	citisen a deep learning based speech signal processing mobile application
topic	Speech enhancement model adaptation background noise conversion deep learning mobile application
url	https://ieeexplore.ieee.org/document/9718270/
work_keys_str_mv	AT yuwenchen citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT kuohsuanhung citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT youjinli citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT alexanderchaofukang citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT yahsinlai citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT kaichunliu citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT szuweifu citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT syusiangwang citisenadeeplearningbasedspeechsignalprocessingmobileapplication AT yutsao citisenadeeplearningbasedspeechsignalprocessingmobileapplication

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Similar Items