Developing Effective Techniques for the Recognition of Shanghai Dialect Text

Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for autom...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yida Bao, Zheng Zhang, Mohammad Arifuzzaman, Tran Duc Le, Qi Li, Masuzyo Mwanza, Jiaqing Lin, Philippe Gaillard, Jiafeng Ye
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	BERT support vector machine cultural heritage preservation Jieba Shanghai dialect
Online Access:	https://ieeexplore.ieee.org/document/11053757/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849706738472714240
author	Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye
author_facet	Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye
author_sort	Yida Bao
collection	DOAJ
description	Recognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.
format	Article
id	doaj-art-4328fa0123ab4ab9bb439d4b587b2b90
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-4328fa0123ab4ab9bb439d4b587b2b902025-08-20T03:16:07ZengIEEEIEEE Access2169-35362025-01-011311180211181310.1109/ACCESS.2025.358370811053757Developing Effective Techniques for the Recognition of Shanghai Dialect TextYida Bao0https://orcid.org/0000-0002-7789-2735Zheng Zhang1https://orcid.org/0000-0003-1707-624XMohammad Arifuzzaman2Tran Duc Le3https://orcid.org/0000-0003-3735-0314Qi Li4Masuzyo Mwanza5Jiaqing Lin6Philippe Gaillard7Jiafeng Ye8Department of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Computer Science Information Systems, Murray State University, Murray, KY, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, WI, USADepartment of Mathematics and Computer Science, Fisk University, Nashville, TN, USADepartment of Mathematics and Statistics, Auburn University, Auburn, AL, USAShanghai Pudong Foreign Language School, Shanghai, ChinaDepartment of Biostatistics, Data Science, and Epidemiology, Augusta University, Augusta, GA, USALAF-NERC, Shanghai Jiao Tong University, Shanghai, ChinaRecognizing Shanghai dialect text is crucial for preserving local dialects, yet research on its automatic distinction from Standard Mandarin remains limited. We construct a carefully balanced dataset specifically for the task of Shanghai dialect recognition and propose a two-stage approach for automatic language classification. In the first stage, we employ Jieba tokenization to retain dialect-specific lexical nuances, ensuring essential semantic and syntactic distinctions are captured. Next, we independently train both a BERT-Chinese-Based classifier and a traditional Support Vector Machine classifier for dialect recognition. The BERT model leverages powerful contextual representations to capture subtle differences between Shanghai dialect and Standard Mandarin, while the Support Vector Machine serves as a conventional baseline. Extensive experiments comparing the two approaches revealed that, although the Support Vector Machine can adequately perform the classification task, the BERT-Based classifier achieves significantly higher accuracy and is more sensitive to the nuanced linguistic features of the dialect. Further analysis through attention visualization reveals how the model specifically attends to unique dialectal features, highlighting distinctive lexical and structural differences between Shanghai dialect and Mandarin text. To the best of our knowledge, this study is the first to apply NLP techniques for language classification between Shanghai dialect and Standard Mandarin, emphasizing the potential for automated dialect recognition as an effective method for dialect documentation and preservation.https://ieeexplore.ieee.org/document/11053757/BERTsupport vector machinecultural heritage preservationJiebaShanghai dialect
spellingShingle	Yida Bao Zheng Zhang Mohammad Arifuzzaman Tran Duc Le Qi Li Masuzyo Mwanza Jiaqing Lin Philippe Gaillard Jiafeng Ye Developing Effective Techniques for the Recognition of Shanghai Dialect Text IEEE Access BERT support vector machine cultural heritage preservation Jieba Shanghai dialect
title	Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_full	Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_fullStr	Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_full_unstemmed	Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_short	Developing Effective Techniques for the Recognition of Shanghai Dialect Text
title_sort	developing effective techniques for the recognition of shanghai dialect text
topic	BERT support vector machine cultural heritage preservation Jieba Shanghai dialect
url	https://ieeexplore.ieee.org/document/11053757/
work_keys_str_mv	AT yidabao developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT zhengzhang developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT mohammadarifuzzaman developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT tranducle developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT qili developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT masuzyomwanza developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT jiaqinglin developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT philippegaillard developingeffectivetechniquesfortherecognitionofshanghaidialecttext AT jiafengye developingeffectivetechniquesfortherecognitionofshanghaidialecttext

Developing Effective Techniques for the Recognition of Shanghai Dialect Text

Similar Items