Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models

In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wa...

Full description

Saved in:
Bibliographic Details
Main Authors: Francisco Javier Lima Florido, Gloria Corpas Pastor
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/14/3/102
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850089899979440128
author Francisco Javier Lima Florido
Gloria Corpas Pastor
author_facet Francisco Javier Lima Florido
Gloria Corpas Pastor
author_sort Francisco Javier Lima Florido
collection DOAJ
description In recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification.
format Article
id doaj-art-e18f527cd8a14c9983757820e0715dcb
institution DOAJ
issn 2073-431X
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Computers
spelling doaj-art-e18f527cd8a14c9983757820e0715dcb2025-08-20T02:42:40ZengMDPI AGComputers2073-431X2025-03-0114310210.3390/computers14030102Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language ModelsFrancisco Javier Lima Florido0Gloria Corpas Pastor1Instituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainInstituto Universitario de Investigación de Tecnologías Lingüísticas Multilingües (IUITLM), University of Malaga, 29010 Malaga, SpainIn recent years, the advances in deep neural networks (DNNs) and large language models (LLMs) have led to major breakthroughs and new levels of performance in Natural Language Processing (NLP), including tasks related to speech processing. Based on these new trends, new models such as Whisper and Wav2Vec 2.0 achieve robust performance in speech processing tasks, even in speech-to-text translation and end-to-end speech translation, far exceeding all previous results. Although these models have shown excellent results in real-time speech processing, they still have some accuracy issues for some tasks and high latency problems when working with large amounts of audio data. In addition, many of them need audio to be segmented and labelled for speech synthesis and annotation tasks. Speaker diarisation, background noise detection, prosodic boundary detection and accent classification are some of the pre-processing tasks required in these cases. In this study, we will fine-tune a small Wav2Vec 2.0 base model for multi-task classification and audio segmentation. A corpus of spoken American English will be used for the experiments. We intend to explore this new approach and, more specifically, the performance of the model with regard to prosodic boundaries detection for audio segmentation, and advanced accent identification.https://www.mdpi.com/2073-431X/14/3/102speech processingprosodic boundaries detectionspeaker change detectionaccent classificationtransformer architectureWav2Vec2
spellingShingle Francisco Javier Lima Florido
Gloria Corpas Pastor
Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
Computers
speech processing
prosodic boundaries detection
speaker change detection
accent classification
transformer architecture
Wav2Vec2
title Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_full Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_fullStr Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_full_unstemmed Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_short Advanced Identification of Prosodic Boundaries, Speakers, and Accents Through Multi-Task Audio Pre-Processing and Speech Language Models
title_sort advanced identification of prosodic boundaries speakers and accents through multi task audio pre processing and speech language models
topic speech processing
prosodic boundaries detection
speaker change detection
accent classification
transformer architecture
Wav2Vec2
url https://www.mdpi.com/2073-431X/14/3/102
work_keys_str_mv AT franciscojavierlimaflorido advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels
AT gloriacorpaspastor advancedidentificationofprosodicboundariesspeakersandaccentsthroughmultitaskaudiopreprocessingandspeechlanguagemodels