V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders
This paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder lear...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11014058/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850229981251108864 |
|---|---|
| author | Takato Fujimoto Kei Hashimoto Yoshihiko Nankaku Keiichi Tokuda |
| author_facet | Takato Fujimoto Kei Hashimoto Yoshihiko Nankaku Keiichi Tokuda |
| author_sort | Takato Fujimoto |
| collection | DOAJ |
| description | This paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder learns the stochastic components as hierarchical latent representations in a data-driven manner and generates diverse waveforms in the time domain. VAEs tend to suffer from a phenomenon known as posterior collapse, in which little data information is encoded in the latent variable. To address this problem, we introduce a carefully designed architecture and skip loss that encourage encoding to latent variables in deep layers. Additionally, VAEs suffer from low-quality samples generated using prior distribution due to the prior hole problem. To improve the sample quality, we impose a constraint on latent information in each layer of the hierarchical VAE and demonstrate that this constrained optimization significantly affects the sample quality. Experimental results using single-speaker and multi-speaker corpora showed that V2Coder is a competitive neural vocoder comparable to existing non-autoregressive neural vocoders based on deep generative models. V2Coder generates high-quality speech waveforms faster than real-time on both GPU and CPU. |
| format | Article |
| id | doaj-art-0608b298199448cebc8819d76f5e14e0 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-0608b298199448cebc8819d76f5e14e02025-08-20T02:04:00ZengIEEEIEEE Access2169-35362025-01-0113928339284710.1109/ACCESS.2025.357290411014058V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational AutoencodersTakato Fujimoto0https://orcid.org/0009-0008-5390-3701Kei Hashimoto1https://orcid.org/0000-0003-2081-0396Yoshihiko Nankaku2https://orcid.org/0009-0000-3978-5130Keiichi Tokuda3https://orcid.org/0000-0001-6143-0133Nagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanNagoya Institute of Technology, Nagoya, JapanThis paper introduces V2Coder, a non-autoregressive vocoder based on hierarchical variational autoencoders (VAEs). The hierarchical VAE with hierarchically extended prior and approximate posterior distributions is highly expressive for modeling stochastic components of speech waveforms. V2Coder learns the stochastic components as hierarchical latent representations in a data-driven manner and generates diverse waveforms in the time domain. VAEs tend to suffer from a phenomenon known as posterior collapse, in which little data information is encoded in the latent variable. To address this problem, we introduce a carefully designed architecture and skip loss that encourage encoding to latent variables in deep layers. Additionally, VAEs suffer from low-quality samples generated using prior distribution due to the prior hole problem. To improve the sample quality, we impose a constraint on latent information in each layer of the hierarchical VAE and demonstrate that this constrained optimization significantly affects the sample quality. Experimental results using single-speaker and multi-speaker corpora showed that V2Coder is a competitive neural vocoder comparable to existing non-autoregressive neural vocoders based on deep generative models. V2Coder generates high-quality speech waveforms faster than real-time on both GPU and CPU.https://ieeexplore.ieee.org/document/11014058/Constrained optimizationhierarchical variational autoencodersneural vocodersposterior collapseprior hole problemspeech synthesis |
| spellingShingle | Takato Fujimoto Kei Hashimoto Yoshihiko Nankaku Keiichi Tokuda V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders IEEE Access Constrained optimization hierarchical variational autoencoders neural vocoders posterior collapse prior hole problem speech synthesis |
| title | V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders |
| title_full | V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders |
| title_fullStr | V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders |
| title_full_unstemmed | V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders |
| title_short | V2Coder: A Non-Autoregressive Vocoder Based on Hierarchical Variational Autoencoders |
| title_sort | v2coder a non autoregressive vocoder based on hierarchical variational autoencoders |
| topic | Constrained optimization hierarchical variational autoencoders neural vocoders posterior collapse prior hole problem speech synthesis |
| url | https://ieeexplore.ieee.org/document/11014058/ |
| work_keys_str_mv | AT takatofujimoto v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders AT keihashimoto v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders AT yoshihikonankaku v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders AT keiichitokuda v2coderanonautoregressivevocoderbasedonhierarchicalvariationalautoencoders |