A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.

In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequ...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhaohui Du, Xiaofeng Zhao, Lin Li, Baohua Yu, Lijiang Miao
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0324048
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849731738656309248
author Zhaohui Du
Xiaofeng Zhao
Lin Li
Baohua Yu
Lijiang Miao
author_facet Zhaohui Du
Xiaofeng Zhao
Lin Li
Baohua Yu
Lijiang Miao
author_sort Zhaohui Du
collection DOAJ
description In recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency.
format Article
id doaj-art-352f495d96ea45e7b06131729d227aad
institution DOAJ
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-352f495d96ea45e7b06131729d227aad2025-08-20T03:08:27ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e032404810.1371/journal.pone.0324048A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.Zhaohui DuXiaofeng ZhaoLin LiBaohua YuLijiang MiaoIn recent years, empowered by artificial intelligence technologies, computer-assisted language learning systems have gradually become a hot topic of research. Currently, the mainstream pronunciation assessment models rely on advanced speech recognition technology, converting speech into phoneme sequences, and then determining mispronounced phonemes through sequence comparison. To optimize the phoneme recognition task in pronunciation evaluation, this paper proposes a Chinese pronunciation phoneme recognition model based on the improved Zipformer-RNN-T(Pruned) architecture, aiming to improve recognition accuracy and reduce parameter count. First, the AISHELL1-PHONEME and ST-CMDS-PHONEME datasets for Mandarin phoneme recognition through data preprocessing. Then, three layers of the Zipformer Block architecture are introduced into the Zipformer encoder to significantly enhance model performance. In the stateless Pred Network, the GELU activation function is adopted to effectively prevent neuron deactivation. Furthermore, a hybrid Pruned RNN-T/CTC Loss fusion strategy is proposed, further optimizing recognition performance. The experimental results demonstrate that the method performs excellently in the phoneme recognition task, achieving a Word Error Rate (WER) of 1.92% (Dev) and 2.12% (Test) on the AISHELL1-PHONEME dataset, and 4.28% (Dev) and 4.51% (Test) on the ST-CMDS-PHONEME dataset. Moreover, the model requires only 61.1M parameters, striking a balance between performance and efficiency.https://doi.org/10.1371/journal.pone.0324048
spellingShingle Zhaohui Du
Xiaofeng Zhao
Lin Li
Baohua Yu
Lijiang Miao
A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
PLoS ONE
title A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
title_full A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
title_fullStr A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
title_full_unstemmed A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
title_short A study on phonemes recognition method for Mandarin pronunciation based on improved Zipformer-RNN-T(Pruned) modeling.
title_sort study on phonemes recognition method for mandarin pronunciation based on improved zipformer rnn t pruned modeling
url https://doi.org/10.1371/journal.pone.0324048
work_keys_str_mv AT zhaohuidu astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT xiaofengzhao astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT linli astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT baohuayu astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT lijiangmiao astudyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT zhaohuidu studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT xiaofengzhao studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT linli studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT baohuayu studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling
AT lijiangmiao studyonphonemesrecognitionmethodformandarinpronunciationbasedonimprovedzipformerrnntprunedmodeling