CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evalu...

Full description

Saved in:
Bibliographic Details
Main Authors: Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9718270/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849223135104073728
author Yu-Wen Chen
Kuo-Hsuan Hung
You-Jin Li
Alexander Chao-Fu Kang
Ya-Hsin Lai
Kai-Chun Liu
Szu-Wei Fu
Syu-Siang Wang
Yu Tsao
author_facet Yu-Wen Chen
Kuo-Hsuan Hung
You-Jin Li
Alexander Chao-Fu Kang
Ya-Hsin Lai
Kai-Chun Liu
Szu-Wei Fu
Syu-Siang Wang
Yu Tsao
author_sort Yu-Wen Chen
collection DOAJ
description This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from prerecordings or instant recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for BNC, CITISEN removes the original background noise using an SE model and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals by SE achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals of original and converted backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.
format Article
id doaj-art-b63b9b86615e46eb8293e39aa85c25fb
institution Kabale University
issn 2169-3536
language English
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-b63b9b86615e46eb8293e39aa85c25fb2025-08-25T23:00:25ZengIEEEIEEE Access2169-35362022-01-0110460824609910.1109/ACCESS.2022.31534699718270CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile ApplicationYu-Wen Chen0https://orcid.org/0000-0002-7473-0570Kuo-Hsuan Hung1You-Jin Li2Alexander Chao-Fu Kang3https://orcid.org/0000-0001-7625-4910Ya-Hsin Lai4https://orcid.org/0000-0002-6286-8441Kai-Chun Liu5https://orcid.org/0000-0001-7867-4716Szu-Wei Fu6Syu-Siang Wang7https://orcid.org/0000-0002-2652-5521Yu Tsao8https://orcid.org/0000-0001-6956-0418Research Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanResearch Center for Information Technology Innovation, Academia Sinica, Taipei, TaiwanThis study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from prerecordings or instant recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for BNC, CITISEN removes the original background noise using an SE model and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people’s tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals by SE achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals of original and converted backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.https://ieeexplore.ieee.org/document/9718270/Speech enhancementmodel adaptationbackground noise conversiondeep learningmobile application
spellingShingle Yu-Wen Chen
Kuo-Hsuan Hung
You-Jin Li
Alexander Chao-Fu Kang
Ya-Hsin Lai
Kai-Chun Liu
Szu-Wei Fu
Syu-Siang Wang
Yu Tsao
CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
IEEE Access
Speech enhancement
model adaptation
background noise conversion
deep learning
mobile application
title CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_full CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_fullStr CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_full_unstemmed CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_short CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
title_sort citisen a deep learning based speech signal processing mobile application
topic Speech enhancement
model adaptation
background noise conversion
deep learning
mobile application
url https://ieeexplore.ieee.org/document/9718270/
work_keys_str_mv AT yuwenchen citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT kuohsuanhung citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT youjinli citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT alexanderchaofukang citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT yahsinlai citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT kaichunliu citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT szuweifu citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT syusiangwang citisenadeeplearningbasedspeechsignalprocessingmobileapplication
AT yutsao citisenadeeplearningbasedspeechsignalprocessingmobileapplication