Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wa...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-03-01
|
| Series: | Computers |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2073-431X/14/3/102 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850089899979440128 |
|---|---|
| author | Francisco Javier Lima Florido Gloria Corpas Pastor |
| author_facet | Francisco Javier Lima Florido Gloria Corpas Pastor |
| author_sort | Francisco Javier Lima Florido |
| collection | DOAJ |
| description | In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification. |
| format | Article |
| id | doaj-art-e18f527cd8a14c9983757820e0715dcb |
| institution | DOAJ |
| issn | 2073-431X |
| language | English |
| publishDate | 2025-03-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Computers |
| spelling | doaj-art-e18f527cd8a14c9983757820e0715dcb2025-08-20T02:42:40ZengMDPI AGComputers2073-431X2025-03-0114310210.3390/computers14030102Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language ModelsFrancisco Javier Lima Florido0Gloria Corpas Pastor1Instituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainInstituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainIn recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification.https://www.mdpi.com/2073-431X/14/3/102speech processingprosodic boundaries detectionspeaker change detectionaccent classificationtransformer architectureWav2Vec2 |
| spellingShingle | Francisco Javier Lima Florido Gloria Corpas Pastor Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models Computers speech processing prosodic boundaries detection speaker change detection accent classification transformer architecture Wav2Vec2 |
| title | Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models |
| title_full | Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models |
| title_fullStr | Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models |
| title_full_unstemmed | Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models |
| title_short | Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models |
| title_sort | advanced identification of prosodic boundaries speakers and accents through multi task audio pre processing and speech language models |
| topic | speech processing prosodic boundaries detection speaker change detection accent classification transformer architecture Wav2Vec2 |
| url | https://www.mdpi.com/2073-431X/14/3/102 |
| work_keys_str_mv | AT franciscojavierlimaflorido advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels AT gloriacorpaspastor advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels |