Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phon...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Sensors |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1424-8220/25/13/3973 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850115466730668032 |
|---|---|
| author | Chen Shen Lu Zhao Cejin Fu Bote Gan Zhenlong Du |
| author_facet | Chen Shen Lu Zhao Cejin Fu Bote Gan Zhenlong Du |
| author_sort | Chen Shen |
| collection | DOAJ |
| description | Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization. |
| format | Article |
| id | doaj-art-0421086b1ead4f9da9688bb0b034ec94 |
| institution | OA Journals |
| issn | 1424-8220 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Sensors |
| spelling | doaj-art-0421086b1ead4f9da9688bb0b034ec942025-08-20T02:36:34ZengMDPI AGSensors1424-82202025-06-012513397310.3390/s25133973Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic AlignmentChen Shen0Lu Zhao1Cejin Fu2Bote Gan3Zhenlong Du4College of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaCollege of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, ChinaCollege of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816, ChinaAlthough Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization.https://www.mdpi.com/1424-8220/25/13/3973voice synthesissinging voice synthesisaudio signal analysisartificial intelligencephoneticscross-lingual |
| spellingShingle | Chen Shen Lu Zhao Cejin Fu Bote Gan Zhenlong Du Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment Sensors voice synthesis singing voice synthesis audio signal analysis artificial intelligence phonetics cross-lingual |
| title | Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment |
| title_full | Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment |
| title_fullStr | Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment |
| title_full_unstemmed | Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment |
| title_short | Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment |
| title_sort | transinger cross lingual singing voice synthesis via ipa based phonetic alignment |
| topic | voice synthesis singing voice synthesis audio signal analysis artificial intelligence phonetics cross-lingual |
| url | https://www.mdpi.com/1424-8220/25/13/3973 |
| work_keys_str_mv | AT chenshen transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment AT luzhao transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment AT cejinfu transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment AT botegan transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment AT zhenlongdu transingercrosslingualsingingvoicesynthesisviaipabasedphoneticalignment |