Coded speech enhancement using auxiliary utterance-level information
Abstract Numerous post-processing methods have been proposed to improve coded speech quality and intelligibility. However, achieving state-of-the-art enhancement and generalisation across varying distortion levels remains a challenge. To bridge this gap, we propose a Lightweight Causal-Transformer-b...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-07-01
|
| Series: | EURASIP Journal on Audio, Speech, and Music Processing |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13636-025-00420-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Numerous post-processing methods have been proposed to improve coded speech quality and intelligibility. However, achieving state-of-the-art enhancement and generalisation across varying distortion levels remains a challenge. To bridge this gap, we propose a Lightweight Causal-Transformer-based Coded Speech Enhancement (LCT-CSE) model employing a causal frequency-time-frequency (FTF) transformer block. This block facilitates temporal and spectral sequential modelling using transformers, efficiently exploiting global dependency across causal-context TF bins while minimising computational overhead. Experimental results indicate that the proposed LCT-CSE model outperforms the considered baselines across mainstream lossy audio codecs, including Opus, AMR-WB, EVS and LC3+, with less footprint and complexity. To further utilise auxiliary, utterance-level information such as bitrate and other general distortion characteristics, building upon the LCT-CSE model, we propose two information incorporation methods. One employs one-hot vector representations and feature fusions, referred to as 1-hot vector-based modulation, while the other dynamically switches information-dependent network paths, termed dynamic linear modulation (DLM). These methods can be used to improve performance in bitrate-information utilisation, with negligible additional computational overhead. The DLM model even achieves comparable performance to bitrate-specific trained (BST) models. We further extend the proposed information incorporation method, DLM, to a generalised scenario, tandem coding. Compared to the two practically used approaches, the DLM-based LCT-CSE model consistently exhibits improved generalisability across varying tandem encoding conditions, based on derivative distortion information. Specifically, it achieves gains up to 0.74 in PESQ, 7% in STOI, and 0.18 in MOS-SIG under various bitrate conditions. This indicates significant potential for further applications where auxiliary information can be utilised. |
|---|---|
| ISSN: | 1687-4722 |