Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment

Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phon...

Full description

Saved in:
Bibliographic Details
Main Authors: Chen Shen, Lu Zhao, Cejin Fu, Bote Gan, Zhenlong Du
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/13/3973
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850115466730668032
author Chen Shen
Lu Zhao
Cejin Fu
Bote Gan
Zhenlong Du
author_facet Chen Shen
Lu Zhao
Cejin Fu
Bote Gan
Zhenlong Du
author_sort Chen Shen
collection DOAJ
description Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization.
format Article
id doaj-art-0421086b1ead4f9da9688bb0b034ec94
institution OA Journals
issn 1424-8220
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-0421086b1ead4f9da9688bb0b034ec942025-08-20T02:36:34ZengMDPI AGSensors1424-82202025-06-012513397310.3390/s25133973Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic AlignmentChen Shen0Lu Zhao1Cejin Fu2Bote Gan3Zhenlong Du4College of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaAlthough Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization.https://www.mdpi.com/1424-8220/25/13/3973voice synthesissinging voice synthesisaudio signal analysisartificial intelligencephoneticscross-lingual
spellingShingle Chen Shen
Lu Zhao
Cejin Fu
Bote Gan
Zhenlong Du
Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
Sensors
voice synthesis
singing voice synthesis
audio signal analysis
artificial intelligence
phonetics
cross-lingual
title Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
title_full Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
title_fullStr Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
title_full_unstemmed Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
title_short Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
title_sort transinger cross lingual singing voice synthesis via ipa based phonetic alignment
topic voice synthesis
singing voice synthesis
audio signal analysis
artificial intelligence
phonetics
cross-lingual
url https://www.mdpi.com/1424-8220/25/13/3973
work_keys_str_mv AT chenshen transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment
AT luzhao transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment
AT cejinfu transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment
AT botegan transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment
AT zhenlongdu transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment