Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning

Abstract Multi-head attention mechanisms have been widely applied in speech pre-training. However, their roles and effectiveness in various downstream tasks have not been fully explored. Attention heads may vary in importance depending on the downstream task. We assume that the attention allocation...

Full description

Saved in:
Bibliographic Details
Main Authors: Yukun Qian, Xuyi Zhuang, Mingjiang Wang
Format: Article
Language:English
Published: SpringerOpen 2025-07-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-025-00411-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Multi-head attention mechanisms have been widely applied in speech pre-training. However, their roles and effectiveness in various downstream tasks have not been fully explored. Attention heads may vary in importance depending on the downstream task. We assume that the attention allocation in the attention mechanism is similar to the information bottleneck, aiming to highlight the parts that are important for the task. We introduce the information bottleneck into multi-head attention to estimate the degree of mutual information between each attention head’s output and the input, guiding it to focus on useful information. Additionally, we propose a method to measure the contribution of attention heads to the tasks. We also prune the model heads based on their contributions, offering interpretable direction for model pruning. Our experiments, which compared the pruning effectiveness of our method with that of the traditional Taylor expansion method and the integrated gradients method, show that our approach significantly outperforms the former and achieves comparable results with the latter on multiple tasks.
ISSN:1687-4722