Research on Speech Enhancement Translation and Mel-Spectrogram Mapping Method for the Deaf Based on Pix2PixGANs

This study proposes an innovative speech translation method based on Pix2PixGAN, which maps the Mel spectrograms of speech produced by deaf individuals to those of normal-hearing individuals and generates semantically coherent speech output. The objective is to translate speech produced by the deaf...

Full description

Saved in:
Bibliographic Details
Main Authors: Shaoting Zeng, Xinran Xu, Xinyu Chi, Yuqing Liu, Huiting Yu, Feng Zou
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11002503/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study proposes an innovative speech translation method based on Pix2PixGAN, which maps the Mel spectrograms of speech produced by deaf individuals to those of normal-hearing individuals and generates semantically coherent speech output. The objective is to translate speech produced by the deaf into intelligible speech as it would be spoken by hearing individuals, thereby enhancing understandability and supporting assisted communication. A paired Mel spectrogram dataset was constructed using speech from both deaf and normal-hearing individuals. Deaf speech data were manually extracted from video segments, while the corresponding normal-hearing speech was synthesized using a text-to-speech (TTS) system. Mel spectrograms were then extracted as training data. The model is built upon the Pix2PixGAN framework, with deaf speech spectrograms as input and target hearing spectrograms as output. Model performance was evaluated using SSIM, PSNR, and MSE metrics. The results demonstrate excellent fidelity in structure and clarity in signal restoration, particularly in the low-frequency regions associated with semantic content. Unlike traditional deaf speech translation methods, this study innovatively combines Pix2PixGAN with Mel spectrogram representations, reframing the speech translation task as an image-to-image translation problem. By matching and concatenating speech segments from a reference database, the system generates natural and intelligible speech output. User survey results showed high ratings in both semantic consistency and naturalness of the generated speech. This method offers a viable technical pathway for facilitating communication between deaf and hearing individuals and provides a valuable reference for personalized speech enhancement in complex auditory environments.
ISSN:2169-3536