Coded speech enhancement using auxiliary utterance-level information

Abstract Numerous post-processing methods have been proposed to improve coded speech quality and intelligibility. However, achieving state-of-the-art enhancement and generalisation across varying distortion levels remains a challenge. To bridge this gap, we propose a Lightweight Causal-Transformer-b...

Full description

Saved in:
Bibliographic Details
Main Authors: Haixin Zhao, Nilesh Madhu
Format: Article
Language:English
Published: SpringerOpen 2025-07-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-025-00420-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Numerous post-processing methods have been proposed to improve coded speech quality and intelligibility. However, achieving state-of-the-art enhancement and generalisation across varying distortion levels remains a challenge. To bridge this gap, we propose a Lightweight Causal-Transformer-based Coded Speech Enhancement (LCT-CSE) model employing a causal frequency-time-frequency (FTF) transformer block. This block facilitates temporal and spectral sequential modelling using transformers, efficiently exploiting global dependency across causal-context TF bins while minimising computational overhead. Experimental results indicate that the proposed LCT-CSE model outperforms the considered baselines across mainstream lossy audio codecs, including Opus, AMR-WB, EVS and LC3+, with less footprint and complexity. To further utilise auxiliary, utterance-level information such as bitrate and other general distortion characteristics, building upon the LCT-CSE model, we propose two information incorporation methods. One employs one-hot vector representations and feature fusions, referred to as 1-hot vector-based modulation, while the other dynamically switches information-dependent network paths, termed dynamic linear modulation (DLM). These methods can be used to improve performance in bitrate-information utilisation, with negligible additional computational overhead. The DLM model even achieves comparable performance to bitrate-specific trained (BST) models. We further extend the proposed information incorporation method, DLM, to a generalised scenario, tandem coding. Compared to the two practically used approaches, the DLM-based LCT-CSE model consistently exhibits improved generalisability across varying tandem encoding conditions, based on derivative distortion information. Specifically, it achieves gains up to 0.74 in PESQ, 7% in STOI, and 0.18 in MOS-SIG under various bitrate conditions. This indicates significant potential for further applications where auxiliary information can be utilised.
ISSN:1687-4722