Developing Effective Techniques for the Recognition of Shanghai Dialect Text
Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for autom...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11053757/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849706738472714240 |
|---|---|
| author | Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye |
| author_facet | Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye |
| author_sort | Yida Bao |
| collection | DOAJ |
| description | Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation. |
| format | Article |
| id | doaj-art-4328fa0123ab4ab9bb439d4b587b2b90 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-4328fa0123ab4ab9bb439d4b587b2b902025-08-20T03:16:07ZengIEEEIEEE Access2169-35362025-01-011311180211181310.1109/ACCESS.2025.358370811053757Developing Effective Techniques for the Recognition of Shanghai Dialect TextYida Bao0https://orcid.org/0000-0002-7789-2735Zheng Zhang1https://orcid.org/0000-0003-1707-624XMohammad Arifuzzaman2Tran Duc Le3https://orcid.org/0000-0003-3735-0314Qi Li4Masuzyo Mwanza5Jiaqing Lin6Philippe Gaillard7Jiafeng Ye8Department of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Computer Science Information Systems, Murray State University, Murray, KY, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics and Computer Science, Fisk University, Nashville, TN, USADepartment of Mathematics and Statistics, Auburn University, Auburn, AL, USAShanghai Pudong Foreign Language School, Shanghai, ChinaDepartment of Biostatistics, Data Science, and Epidemiology, Augusta University, Augusta, GA, USALAF-NERC, Shanghai Jiao Tong University, Shanghai, ChinaRecognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.https://ieeexplore.ieee.org/document/11053757/BERTsupport vector machinecultural heritage preservationJiebaShanghai dialect |
| spellingShingle | Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye Developing Effective Techniques for the Recognition of Shanghai Dialect Text IEEE Access BERT support vector machine cultural heritage preservation Jieba Shanghai dialect |
| title | Developing Effective Techniques for the Recognition of Shanghai Dialect Text |
| title_full | Developing Effective Techniques for the Recognition of Shanghai Dialect Text |
| title_fullStr | Developing Effective Techniques for the Recognition of Shanghai Dialect Text |
| title_full_unstemmed | Developing Effective Techniques for the Recognition of Shanghai Dialect Text |
| title_short | Developing Effective Techniques for the Recognition of Shanghai Dialect Text |
| title_sort | developing effective techniques for the recognition of shanghai dialect text |
| topic | BERT support vector machine cultural heritage preservation Jieba Shanghai dialect |
| url | https://ieeexplore.ieee.org/document/11053757/ |
| work_keys_str_mv | AT yidabao developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT zhengzhang developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT mohammadarifuzzaman developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT tranducle developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT qili developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT masuzyomwanza developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT jiaqinglin developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT philippegaillard developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT jiafengye developingeffectivetechniquesfortherecognitionofshanghaidialecttext |