Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning
Abstract Multi-head attention mechanisms have been widely applied in speech pre-training. However, their roles and effectiveness in various downstream tasks have not been fully explored. Attention heads may vary in importance depending on the downstream task. We assume that the attention allocation...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-07-01
|
| Series: | EURASIP Journal on Audio, Speech, and Music Processing |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13636-025-00411-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849334674189451264 |
|---|---|
| author | Yukun Qian Xuyi Zhuang Mingjiang Wang |
| author_facet | Yukun Qian Xuyi Zhuang Mingjiang Wang |
| author_sort | Yukun Qian |
| collection | DOAJ |
| description | Abstract Multi-head attention mechanisms have been widely applied in speech pre-training. However, their roles and effectiveness in various downstream tasks have not been fully explored. Attention heads may vary in importance depending on the downstream task. We assume that the attention allocation in the attention mechanism is similar to the information bottleneck, aiming to highlight the parts that are important for the task. We introduce the information bottleneck into multi-head attention to estimate the degree of mutual information between each attention head’s output and the input, guiding it to focus on useful information. Additionally, we propose a method to measure the contribution of attention heads to the tasks. We also prune the model heads based on their contributions, offering interpretable direction for model pruning. Our experiments, which compared the pruning effectiveness of our method with that of the traditional Taylor expansion method and the integrated gradients method, show that our approach significantly outperforms the former and achieves comparable results with the latter on multiple tasks. |
| format | Article |
| id | doaj-art-134775e6ecd2425aae40638c313e07a8 |
| institution | Kabale University |
| issn | 1687-4722 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | EURASIP Journal on Audio, Speech, and Music Processing |
| spelling | doaj-art-134775e6ecd2425aae40638c313e07a82025-08-20T03:45:31ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222025-07-012025111310.1186/s13636-025-00411-8Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruningYukun Qian0Xuyi Zhuang1Mingjiang Wang2Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyAbstract Multi-head attention mechanisms have been widely applied in speech pre-training. However, their roles and effectiveness in various downstream tasks have not been fully explored. Attention heads may vary in importance depending on the downstream task. We assume that the attention allocation in the attention mechanism is similar to the information bottleneck, aiming to highlight the parts that are important for the task. We introduce the information bottleneck into multi-head attention to estimate the degree of mutual information between each attention head’s output and the input, guiding it to focus on useful information. Additionally, we propose a method to measure the contribution of attention heads to the tasks. We also prune the model heads based on their contributions, offering interpretable direction for model pruning. Our experiments, which compared the pruning effectiveness of our method with that of the traditional Taylor expansion method and the integrated gradients method, show that our approach significantly outperforms the former and achieves comparable results with the latter on multiple tasks.https://doi.org/10.1186/s13636-025-00411-8AttributionInformational bottleneckMulti-head attentionExplainable AI |
| spellingShingle | Yukun Qian Xuyi Zhuang Mingjiang Wang Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning EURASIP Journal on Audio, Speech, and Music Processing Attribution Informational bottleneck Multi-head attention Explainable AI |
| title | Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning |
| title_full | Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning |
| title_fullStr | Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning |
| title_full_unstemmed | Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning |
| title_short | Head information bottleneck (HIB): leveraging information bottleneck for efficient transformer head attribution and pruning |
| title_sort | head information bottleneck hib leveraging information bottleneck for efficient transformer head attribution and pruning |
| topic | Attribution Informational bottleneck Multi-head attention Explainable AI |
| url | https://doi.org/10.1186/s13636-025-00411-8 |
| work_keys_str_mv | AT yukunqian headinformationbottleneckhibleveraginginformationbottleneckforefficienttransformerheadattributionandpruning AT xuyizhuang headinformationbottleneckhibleveraginginformationbottleneckforefficienttransformerheadattributionandpruning AT mingjiangwang headinformationbottleneckhibleveraginginformationbottleneckforefficienttransformerheadattributionandpruning |