Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models

In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Francisco Javier Lima Florido, Gloria Corpas Pastor
Format:	Article
Language:	English
Published:	MDPI AG 2025-03-01
Series:	Computers
Subjects:	speech processing prosodic boundaries detection speaker change detection accent classification transformer architecture Wav2Vec2
Online Access:	https://www.mdpi.com/2073-431X/14/3/102
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850089899979440128
author	Francisco Javier Lima Florido Gloria Corpas Pastor
author_facet	Francisco Javier Lima Florido Gloria Corpas Pastor
author_sort	Francisco Javier Lima Florido
collection	DOAJ
description	In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification.
format	Article
id	doaj-art-e18f527cd8a14c9983757820e0715dcb
institution	DOAJ
issn	2073-431X
language	English
publishDate	2025-03-01
publisher	MDPI AG
record_format	Article
series	Computers
spelling	doaj-art-e18f527cd8a14c9983757820e0715dcb2025-08-20T02:42:40ZengMDPI AGComputers2073-431X2025-03-0114310210.3390/computers14030102Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language ModelsFrancisco Javier Lima Florido0Gloria Corpas Pastor1Instituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainInstituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainIn recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification.https://www.mdpi.com/2073-431X/14/3/102speech processingprosodic boundaries detectionspeaker change detectionaccent classificationtransformer architectureWav2Vec2
spellingShingle	Francisco Javier Lima Florido Gloria Corpas Pastor Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models Computers speech processing prosodic boundaries detection speaker change detection accent classification transformer architecture Wav2Vec2
title	Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_full	Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_fullStr	Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_full_unstemmed	Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_short	Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_sort	advanced identification of prosodic boundaries speakers and accents through multi task audio pre processing and speech language models
topic	speech processing prosodic boundaries detection speaker change detection accent classification transformer architecture Wav2Vec2
url	https://www.mdpi.com/2073-431X/14/3/102
work_keys_str_mv	AT franciscojavierlimaflorido advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels AT gloriacorpaspastor advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels

Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models

Similar Items