The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets

This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monit...

Full description

Saved in:
Bibliographic Details
Main Author: Fernando Henrique Calderón Alvarado
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/7/3490
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849730993507794944
author Fernando Henrique Calderón Alvarado
author_facet Fernando Henrique Calderón Alvarado
author_sort Fernando Henrique Calderón Alvarado
collection DOAJ
description This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications.
format Article
id doaj-art-71622cd30c57425995e2cc0d285e1fe7
institution DOAJ
issn 2076-3417
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-71622cd30c57425995e2cc0d285e1fe72025-08-20T03:08:43ZengMDPI AGApplied Sciences2076-34172025-03-01157349010.3390/app15073490The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic DatasetsFernando Henrique Calderón Alvarado0Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 24205, TaiwanThis study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications.https://www.mdpi.com/2076-3417/15/7/3490synthetic datasetslarge language modelsemotion detectionlow-resource languagesregional variations
spellingShingle Fernando Henrique Calderón Alvarado
The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
Applied Sciences
synthetic datasets
large language models
emotion detection
low-resource languages
regional variations
title The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
title_full The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
title_fullStr The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
title_full_unstemmed The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
title_short The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
title_sort impact of linguistic variations on emotion detection a study of regionally specific synthetic datasets
topic synthetic datasets
large language models
emotion detection
low-resource languages
regional variations
url https://www.mdpi.com/2076-3417/15/7/3490
work_keys_str_mv AT fernandohenriquecalderonalvarado theimpactoflinguisticvariationsonemotiondetectionastudyofregionallyspecificsyntheticdatasets
AT fernandohenriquecalderonalvarado impactoflinguisticvariationsonemotiondetectionastudyofregionallyspecificsyntheticdatasets