A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding

This paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector trunc...

Full description

Saved in:
Bibliographic Details
Main Authors: Benjamin Stahl, Simon Windtner, Alois Sontacchi
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10720741/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850199424202964992
author Benjamin Stahl
Simon Windtner
Alois Sontacchi
author_facet Benjamin Stahl
Simon Windtner
Alois Sontacchi
author_sort Benjamin Stahl
collection DOAJ
description This paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector truncation and padding technique. We evaluate both fixed- and scalable-bitrate variants of the proposed method, comparing them to a baseline vector quantization-based coder. The method is also benchmarked against Opus, Lyra v2, EnCodec, and AudioDec using objective metrics and subjective ratings from a MUSHRA listening test. At 1.38 kbps, the proposed method significantly outperforms Lyra v2 at 3kbps and at 5.51kbps matches its performance at 6kbps. Although AudioDec significantly surpasses the proposed method at 6.4kbps on test data from the TSP speech dataset, the proposed method shows competitive or superior results on withheld speakers from the VCTK dataset. The results show that recurrent coding with binary latent vectors is a viable alternative to prevailing vector quantization-based approaches.
format Article
id doaj-art-a9b7df0310d949c38998c15c034ce01a
institution OA Journals
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-a9b7df0310d949c38998c15c034ce01a2025-08-20T02:12:37ZengIEEEIEEE Access2169-35362024-01-011215923915925110.1109/ACCESS.2024.348235910720741A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech CodingBenjamin Stahl0https://orcid.org/0000-0001-6446-0039Simon Windtner1Alois Sontacchi2https://orcid.org/0009-0008-9205-209XInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaInstitute of Electronic Music and Acoustics, University of Music and Performing Arts Graz, Graz, AustriaThis paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector truncation and padding technique. We evaluate both fixed- and scalable-bitrate variants of the proposed method, comparing them to a baseline vector quantization-based coder. The method is also benchmarked against Opus, Lyra v2, EnCodec, and AudioDec using objective metrics and subjective ratings from a MUSHRA listening test. At 1.38 kbps, the proposed method significantly outperforms Lyra v2 at 3kbps and at 5.51kbps matches its performance at 6kbps. Although AudioDec significantly surpasses the proposed method at 6.4kbps on test data from the TSP speech dataset, the proposed method shows competitive or superior results on withheld speakers from the VCTK dataset. The results show that recurrent coding with binary latent vectors is a viable alternative to prevailing vector quantization-based approaches.https://ieeexplore.ieee.org/document/10720741/Speech codecsrecurrent neural networksbinary codesgenerative adversarial networksvocoders
spellingShingle Benjamin Stahl
Simon Windtner
Alois Sontacchi
A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
IEEE Access
Speech codecs
recurrent neural networks
binary codes
generative adversarial networks
vocoders
title A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
title_full A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
title_fullStr A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
title_full_unstemmed A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
title_short A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
title_sort bitrate scalable variational recurrent mel spectrogram coder for real time resynthesis based speech coding
topic Speech codecs
recurrent neural networks
binary codes
generative adversarial networks
vocoders
url https://ieeexplore.ieee.org/document/10720741/
work_keys_str_mv AT benjaminstahl abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding
AT simonwindtner abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding
AT aloissontacchi abitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding
AT benjaminstahl bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding
AT simonwindtner bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding
AT aloissontacchi bitratescalablevariationalrecurrentmelspectrogramcoderforrealtimeresynthesisbasedspeechcoding