A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice co...

Full description

Saved in:

Bibliographic Details
Main Authors:	Assila Yousuf, David Solomon George
Format:	Article
Language:	English
Published:	AIMS Press 2024-06-01
Series:	AIMS Electronics and Electrical Engineering
Subjects:	one-shot singing voice conversion instance normalization adain again hybrid cnn-lstm model
Online Access:	https://www.aimspress.com/article/doi/10.3934/electreng.2024013
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590266714816512
author	Assila Yousuf David Solomon George
author_facet	Assila Yousuf David Solomon George
author_sort	Assila Yousuf
collection	DOAJ
description	Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.
format	Article
id	doaj-art-84c69169f5174828a94a2c30d3f3298c
institution	Kabale University
issn	2578-1588
language	English
publishDate	2024-06-01
publisher	AIMS Press
record_format	Article
series	AIMS Electronics and Electrical Engineering
spelling	doaj-art-84c69169f5174828a94a2c30d3f3298c2025-01-24T01:10:37ZengAIMS PressAIMS Electronics and Electrical Engineering2578-15882024-06-018328230010.3934/electreng.2024013A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversionAssila Yousuf0David Solomon George1Department of Electronics and Communication Engineering, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, 686501, India (Affiliated to APJ Abdul Kalam Technological University, Kerala)Department of Electronics and Communication Engineering, Rajiv Gandhi Institute of Technology, Kottayam, Kerala, 686501, India (Affiliated to APJ Abdul Kalam Technological University, Kerala)Singing voice conversion methods encounter challenges in achieving a delicate balance between synthesis quality and singer similarity. Traditional voice conversion techniques primarily emphasize singer similarity, often leading to robotic-sounding singing voices. Deep learning-based singing voice conversion techniques, however, focus on disentangling singer-dependent and singer-independent features. While this approach can enhance the quality of synthesized singing voices, many voice conversion systems still grapple with the issue of singer-dependent feature leakage into content embeddings. In the proposed singing voice conversion technique, an encoder decoder framework was implemented using a hybrid model of convolutional neural network (CNN) accompanied by long short term memory (LSTM). This paper investigated the use of activation guidance and adaptive instance normalization techniques for one shot singing voice conversion. The instance normalization (IN) layers within the auto-encoder effectively separated singer and content representations. During conversion, singer representations were transferred using adaptive instance normalization (AdaIN) layers. This singing voice system with the help of activation function prevented the transfer of singer information while conveying the singing content. Additionally, the fusion of LSTM with CNN can enhance voice conversion models by capturing both local and contextual features. The one-shot capability simplified the architecture, utilizing a single encoder and decoder. Impressively, the proposed hybrid CNN-LSTM model achieved remarkable performance without compromising either quality or similarity. The objective and subjective evaluation assessments showed that the proposed hybrid CNN-LSTM model outperformed the baseline architectures. Evaluation results showed a mean opinion score (MOS) of 2.93 for naturalness and 3.35 for melodic similarity. These hybrid CNN-LSTM techniques allowed it to perform high-quality voice conversion with minimal training data, making it a promising solution for various applications.https://www.aimspress.com/article/doi/10.3934/electreng.2024013one-shot singing voice conversioninstance normalizationadainagainhybrid cnn-lstm model
spellingShingle	Assila Yousuf David Solomon George A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion AIMS Electronics and Electrical Engineering one-shot singing voice conversion instance normalization adain again hybrid cnn-lstm model
title	A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion
title_full	A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion
title_fullStr	A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion
title_full_unstemmed	A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion
title_short	A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion
title_sort	hybrid cnn lstm model with adaptive instance normalization for one shot singing voice conversion
topic	one-shot singing voice conversion instance normalization adain again hybrid cnn-lstm model
url	https://www.aimspress.com/article/doi/10.3934/electreng.2024013
work_keys_str_mv	AT assilayousuf ahybridcnnlstmmodelwithadaptiveinstancenormalizationforoneshotsingingvoiceconversion AT davidsolomongeorge ahybridcnnlstmmodelwithadaptiveinstancenormalizationforoneshotsingingvoiceconversion AT assilayousuf hybridcnnlstmmodelwithadaptiveinstancenormalizationforoneshotsingingvoiceconversion AT davidsolomongeorge hybridcnnlstmmodelwithadaptiveinstancenormalizationforoneshotsingingvoiceconversion

A hybrid CNN-LSTM model with adaptive instance normalization for one shot singing voice conversion

Similar Items