A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
This paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector trunc...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10720741/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850199424202964992 |
|---|---|
| author | Benjamin Stahl Simon Windtner Alois Sontacchi |
| author_facet | Benjamin Stahl Simon Windtner Alois Sontacchi |
| author_sort | Benjamin Stahl |
| collection | DOAJ |
| description | This paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector truncation and padding technique. We evaluate both fixed- and scalable-bitrate variants of the proposed method, comparing them to a baseline vector quantization-based coder. The method is also benchmarked against Opus, Lyra v2, EnCodec, and AudioDec using objective metrics and subjective ratings from a MUSHRA listening test. At 1.38 kbps, the proposed method significantly outperforms Lyra v2 at 3kbps and at 5.51kbps matches its performance at 6kbps. Although AudioDec significantly surpasses the proposed method at 6.4kbps on test data from the TSP speech dataset, the proposed method shows competitive or superior results on withheld speakers from the VCTK dataset. The results show that recurrent coding with binary latent vectors is a viable alternative to prevailing vector quantization-based approaches. |
| format | Article |
| id | doaj-art-a9b7df0310d949c38998c15c034ce01a |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a9b7df0310d949c38998c15c034ce01a2025-08-20T02:12:37ZengIEEEIEEE Access2169-35362024-01-011215923915925110.1109/ACCESS.2024.348235910720741A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech CodingBenjamin Stahl0https://orcid.org/0000-0001-6446-0039Simon Windtner1Alois Sontacchi2https://orcid.org/0009-0008-9205-209XInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaThis paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector truncation and padding technique. We evaluate both fixed- and scalable-bitrate variants of the proposed method, comparing them to a baseline vector quantization-based coder. The method is also benchmarked against Opus, Lyra v2, EnCodec, and AudioDec using objective metrics and subjective ratings from a MUSHRA listening test. At 1.38 kbps, the proposed method significantly outperforms Lyra v2 at 3kbps and at 5.51kbps matches its performance at 6kbps. Although AudioDec significantly surpasses the proposed method at 6.4kbps on test data from the TSP speech dataset, the proposed method shows competitive or superior results on withheld speakers from the VCTK dataset. The results show that recurrent coding with binary latent vectors is a viable alternative to prevailing vector quantization-based approaches.https://ieeexplore.ieee.org/document/10720741/Speech codecsrecurrent neural networksbinary codesgenerative adversarial networksvocoders |
| spellingShingle | Benjamin Stahl Simon Windtner Alois Sontacchi A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding IEEE Access Speech codecs recurrent neural networks binary codes generative adversarial networks vocoders |
| title | A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding |
| title_full | A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding |
| title_fullStr | A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding |
| title_full_unstemmed | A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding |
| title_short | A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding |
| title_sort | bitrate scalable variational recurrent mel spectrogram coder for real time resynthesis based speech coding |
| topic | Speech codecs recurrent neural networks binary codes generative adversarial networks vocoders |
| url | https://ieeexplore.ieee.org/document/10720741/ |
| work_keys_str_mv | AT benjaminstahl abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding AT simonwindtner abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding AT aloissontacchi abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding AT benjaminstahl bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding AT simonwindtner bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding AT aloissontacchi bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding |