A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequ...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0324048 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849731738656309248 |
|---|---|
| author | Zhaohui Du Xiaofeng Zhao Lin Li Baohua Yu Lijiang Miao |
| author_facet | Zhaohui Du Xiaofeng Zhao Lin Li Baohua Yu Lijiang Miao |
| author_sort | Zhaohui Du |
| collection | DOAJ |
| description | In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency. |
| format | Article |
| id | doaj-art-352f495d96ea45e7b06131729d227aad |
| institution | DOAJ |
| issn | 1932-6203 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-352f495d96ea45e7b06131729d227aad2025-08-20T03:08:27ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032404810.1371/journal.pone.0324048A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.Zhaohui DuXiaofeng ZhaoLin LiBaohua YuLijiang MiaoIn recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency.https://doi.org/10.1371/journal.pone.0324048 |
| spellingShingle | Zhaohui Du Xiaofeng Zhao Lin Li Baohua Yu Lijiang Miao A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. PLoS ONE |
| title | A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. |
| title_full | A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. |
| title_fullStr | A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. |
| title_full_unstemmed | A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. |
| title_short | A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling. |
| title_sort | study on phonemes recognition method for mandarin pronunciation based on improved zipformer rnn t pruned modeling |
| url | https://doi.org/10.1371/journal.pone.0324048 |
| work_keys_str_mv | AT zhaohuidu astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT xiaofengzhao astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT linli astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT baohuayu astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT lijiangmiao astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT zhaohuidu studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT xiaofengzhao studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT linli studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT baohuayu studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling AT lijiangmiao studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling |