The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monit...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/7/3490 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849730993507794944 |
|---|---|
| author | Fernando Henrique Calderón Alvarado |
| author_facet | Fernando Henrique Calderón Alvarado |
| author_sort | Fernando Henrique Calderón Alvarado |
| collection | DOAJ |
| description | This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications. |
| format | Article |
| id | doaj-art-71622cd30c57425995e2cc0d285e1fe7 |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-71622cd30c57425995e2cc0d285e1fe72025-08-20T03:08:43ZengMDPI AGApplied Sciences2076-34172025-03-01157349010.3390/app15073490The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic DatasetsFernando Henrique Calderón Alvarado0Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 24205, TaiwanThis study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications.https://www.mdpi.com/2076-3417/15/7/3490synthetic datasetslarge language modelsemotion detectionlow-resource languagesregional variations |
| spellingShingle | Fernando Henrique Calderón Alvarado The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets Applied Sciences synthetic datasets large language models emotion detection low-resource languages regional variations |
| title | The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets |
| title_full | The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets |
| title_fullStr | The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets |
| title_full_unstemmed | The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets |
| title_short | The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets |
| title_sort | impact of linguistic variations on emotion detection a study of regionally specific synthetic datasets |
| topic | synthetic datasets large language models emotion detection low-resource languages regional variations |
| url | https://www.mdpi.com/2076-3417/15/7/3490 |
| work_keys_str_mv | AT fernandohenriquecalderonalvarado theimpactoflinguisticvariationsonemotiondetectionastudyofregionallyspecificsyntheticdatasets AT fernandohenriquecalderonalvarado impactoflinguisticvariationsonemotiondetectionastudyofregionallyspecificsyntheticdatasets |