Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10589623/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850263517485072384 |
|---|---|
| author | Bokyeong Yoon Ah-Hyun Lee Jinsung Kim Gordon Euhyun Moon |
| author_facet | Bokyeong Yoon Ah-Hyun Lee Jinsung Kim Gordon Euhyun Moon |
| author_sort | Bokyeong Yoon |
| collection | DOAJ |
| description | The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to <inline-formula> <tex-math notation="LaTeX">$2.84\times $ </tex-math></inline-formula> training speedup, <inline-formula> <tex-math notation="LaTeX">$6.87\times $ </tex-math></inline-formula> memory reduction and better accuracy compared to state-of-the-art sparse Transformer models. |
| format | Article |
| id | doaj-art-36d2203371f44d2d892e6e1f40d57a23 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-36d2203371f44d2d892e6e1f40d57a232025-08-20T01:54:57ZengIEEEIEEE Access2169-35362024-01-011213137313138410.1109/ACCESS.2024.342563810589623Exploring Attention Sparsity to Accelerate Transformer Training on GPUsBokyeong Yoon0https://orcid.org/0009-0006-0175-0753Ah-Hyun Lee1Jinsung Kim2https://orcid.org/0000-0003-3751-8869Gordon Euhyun Moon3https://orcid.org/0000-0003-4992-6181Department of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaSchool of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaThe computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to <inline-formula> <tex-math notation="LaTeX">$2.84\times $ </tex-math></inline-formula> training speedup, <inline-formula> <tex-math notation="LaTeX">$6.87\times $ </tex-math></inline-formula> memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.https://ieeexplore.ieee.org/document/10589623/Sparse Transformersparse attentionMHA optimization |
| spellingShingle | Bokyeong Yoon Ah-Hyun Lee Jinsung Kim Gordon Euhyun Moon Exploring Attention Sparsity to Accelerate Transformer Training on GPUs IEEE Access Sparse Transformer sparse attention MHA optimization |
| title | Exploring Attention Sparsity to Accelerate Transformer Training on GPUs |
| title_full | Exploring Attention Sparsity to Accelerate Transformer Training on GPUs |
| title_fullStr | Exploring Attention Sparsity to Accelerate Transformer Training on GPUs |
| title_full_unstemmed | Exploring Attention Sparsity to Accelerate Transformer Training on GPUs |
| title_short | Exploring Attention Sparsity to Accelerate Transformer Training on GPUs |
| title_sort | exploring attention sparsity to accelerate transformer training on gpus |
| topic | Sparse Transformer sparse attention MHA optimization |
| url | https://ieeexplore.ieee.org/document/10589623/ |
| work_keys_str_mv | AT bokyeongyoon exploringattentionsparsitytoacceleratetransformertrainingongpus AT ahhyunlee exploringattentionsparsitytoacceleratetransformertrainingongpus AT jinsungkim exploringattentionsparsitytoacceleratetransformertrainingongpus AT gordoneuhyunmoon exploringattentionsparsitytoacceleratetransformertrainingongpus |