V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders

This paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder lear...

Full description

Saved in:
Bibliographic Details
Main Authors: Takato Fujimoto, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11014058/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850229981251108864
author Takato Fujimoto
Kei Hashimoto
Yoshihiko Nankaku
Keiichi Tokuda
author_facet Takato Fujimoto
Kei Hashimoto
Yoshihiko Nankaku
Keiichi Tokuda
author_sort Takato Fujimoto
collection DOAJ
description This paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder learns the stochastic components as hierarchical latent representations in a data-driven manner and generates diverse waveforms in the time domain. VAEs tend to suffer from a phenomenon known as posterior collapse, in which little data information is encoded in the latent variable. To address this problem, we introduce a carefully designed architecture and skip loss that encourage encoding to latent variables in deep layers. Additionally, VAEs suffer from low-quality samples generated using prior distribution due to the prior hole problem. To improve the sample quality, we impose a constraint on latent information in each layer of the hierarchical VAE and demonstrate that this constrained optimization significantly affects the sample quality. Experimental results using single-speaker and multi-speaker corpora showed that V2Coder is a competitive neural vocoder comparable to existing non-autoregressive neural vocoders based on deep generative models. V2Coder generates high-quality speech waveforms faster than real-time on both GPU and CPU.
format Article
id doaj-art-0608b298199448cebc8819d76f5e14e0
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-0608b298199448cebc8819d76f5e14e02025-08-20T02:04:00ZengIEEEIEEE Access2169-35362025-01-0113928339284710.1109/ACCESS.2025.357290411014058V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational AutoencodersTakato Fujimoto0https://orcid.org/0009-0008-5390-3701Kei Hashimoto1https://orcid.org/0000-0003-2081-0396Yoshihiko Nankaku2https://orcid.org/0009-0000-3978-5130Keiichi Tokuda3https://orcid.org/0000-0001-6143-0133Nagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanThis paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder learns the stochastic components as hierarchical latent representations in a data-driven manner and generates diverse waveforms in the time domain. VAEs tend to suffer from a phenomenon known as posterior collapse, in which little data information is encoded in the latent variable. To address this problem, we introduce a carefully designed architecture and skip loss that encourage encoding to latent variables in deep layers. Additionally, VAEs suffer from low-quality samples generated using prior distribution due to the prior hole problem. To improve the sample quality, we impose a constraint on latent information in each layer of the hierarchical VAE and demonstrate that this constrained optimization significantly affects the sample quality. Experimental results using single-speaker and multi-speaker corpora showed that V2Coder is a competitive neural vocoder comparable to existing non-autoregressive neural vocoders based on deep generative models. V2Coder generates high-quality speech waveforms faster than real-time on both GPU and CPU.https://ieeexplore.ieee.org/document/11014058/Constrained optimizationhierarchical variational autoencodersneural vocodersposterior collapseprior hole problemspeech synthesis
spellingShingle Takato Fujimoto
Kei Hashimoto
Yoshihiko Nankaku
Keiichi Tokuda
V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
IEEE Access
Constrained optimization
hierarchical variational autoencoders
neural vocoders
posterior collapse
prior hole problem
speech synthesis
title V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
title_full V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
title_fullStr V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
title_full_unstemmed V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
title_short V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
title_sort v2coder a non autoregressive vocoder based on hierarchical variational autoencoders
topic Constrained optimization
hierarchical variational autoencoders
neural vocoders
posterior collapse
prior hole problem
speech synthesis
url https://ieeexplore.ieee.org/document/11014058/
work_keys_str_mv AT takatofujimoto v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders
AT keihashimoto v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders
AT yoshihikonankaku v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders
AT keiichitokuda v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders