Developing Effective Techniques for the Recognition of Shanghai Dialect Text

Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for autom...

Full description

Saved in:
Bibliographic Details
Main Authors: Yida Bao, Zheng Zhang, Mohammad Arifuzzaman, Tran Duc Le, Qi Li, Masuzyo Mwanza, Jiaqing Lin, Philippe Gaillard, Jiafeng Ye
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11053757/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849706738472714240
author Yida Bao
Zheng Zhang
Mohammad Arifuzzaman
Tran Duc Le
Qi Li
Masuzyo Mwanza
Jiaqing Lin
Philippe Gaillard
Jiafeng Ye
author_facet Yida Bao
Zheng Zhang
Mohammad Arifuzzaman
Tran Duc Le
Qi Li
Masuzyo Mwanza
Jiaqing Lin
Philippe Gaillard
Jiafeng Ye
author_sort Yida Bao
collection DOAJ
description Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.
format Article
id doaj-art-4328fa0123ab4ab9bb439d4b587b2b90
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-4328fa0123ab4ab9bb439d4b587b2b902025-08-20T03:16:07ZengIEEEIEEE Access2169-35362025-01-011311180211181310.1109/ACCESS.2025.358370811053757Developing Effective Techniques for the Recognition of Shanghai Dialect TextYida Bao0https://orcid.org/0000-0002-7789-2735Zheng Zhang1https://orcid.org/0000-0003-1707-624XMohammad Arifuzzaman2Tran Duc Le3https://orcid.org/0000-0003-3735-0314Qi Li4Masuzyo Mwanza5Jiaqing Lin6Philippe Gaillard7Jiafeng Ye8Department of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Computer Science Information Systems, Murray State University, Murray, KY, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics and Computer Science, Fisk University, Nashville, TN, USADepartment of Mathematics and Statistics, Auburn University, Auburn, AL, USAShanghai Pudong Foreign Language School, Shanghai, ChinaDepartment of Biostatistics, Data Science, and Epidemiology, Augusta University, Augusta, GA, USALAF-NERC, Shanghai Jiao Tong University, Shanghai, ChinaRecognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.https://ieeexplore.ieee.org/document/11053757/BERTsupport vector machinecultural heritage preservationJiebaShanghai dialect
spellingShingle Yida Bao
Zheng Zhang
Mohammad Arifuzzaman
Tran Duc Le
Qi Li
Masuzyo Mwanza
Jiaqing Lin
Philippe Gaillard
Jiafeng Ye
Developing Effective Techniques for the Recognition of Shanghai Dialect Text
IEEE Access
BERT
support vector machine
cultural heritage preservation
Jieba
Shanghai dialect
title Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_full Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_fullStr Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_full_unstemmed Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_short Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_sort developing effective techniques for the recognition of shanghai dialect text
topic BERT
support vector machine
cultural heritage preservation
Jieba
Shanghai dialect
url https://ieeexplore.ieee.org/document/11053757/
work_keys_str_mv AT yidabao developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT zhengzhang developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT mohammadarifuzzaman developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT tranducle developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT qili developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT masuzyomwanza developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT jiaqinglin developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT philippegaillard developingeffectivetechniquesfortherecognitionofshanghaidialecttext
AT jiafengye developingeffectivetechniquesfortherecognitionofshanghaidialecttext