Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...

Full description

Saved in:
Bibliographic Details
Main Authors: Bokyeong Yoon, Ah-Hyun Lee, Jinsung Kim, Gordon Euhyun Moon
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10589623/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850263517485072384
author Bokyeong Yoon
Ah-Hyun Lee
Jinsung Kim
Gordon Euhyun Moon
author_facet Bokyeong Yoon
Ah-Hyun Lee
Jinsung Kim
Gordon Euhyun Moon
author_sort Bokyeong Yoon
collection DOAJ
description The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to <inline-formula> <tex-math notation="LaTeX">$2.84\times $ </tex-math></inline-formula> training speedup, <inline-formula> <tex-math notation="LaTeX">$6.87\times $ </tex-math></inline-formula> memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.
format Article
id doaj-art-36d2203371f44d2d892e6e1f40d57a23
institution OA Journals
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-36d2203371f44d2d892e6e1f40d57a232025-08-20T01:54:57ZengIEEEIEEE Access2169-35362024-01-011213137313138410.1109/ACCESS.2024.342563810589623Exploring Attention Sparsity to Accelerate Transformer Training on GPUsBokyeong Yoon0https://orcid.org/0009-0006-0175-0753Ah-Hyun Lee1Jinsung Kim2https://orcid.org/0000-0003-3751-8869Gordon Euhyun Moon3https://orcid.org/0000-0003-4992-6181Department of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaSchool of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, Republic of KoreaThe computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to <inline-formula> <tex-math notation="LaTeX">$2.84\times $ </tex-math></inline-formula> training speedup, <inline-formula> <tex-math notation="LaTeX">$6.87\times $ </tex-math></inline-formula> memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.https://ieeexplore.ieee.org/document/10589623/Sparse Transformersparse attentionMHA optimization
spellingShingle Bokyeong Yoon
Ah-Hyun Lee
Jinsung Kim
Gordon Euhyun Moon
Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
IEEE Access
Sparse Transformer
sparse attention
MHA optimization
title Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
title_full Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
title_fullStr Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
title_full_unstemmed Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
title_short Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
title_sort exploring attention sparsity to accelerate transformer training on gpus
topic Sparse Transformer
sparse attention
MHA optimization
url https://ieeexplore.ieee.org/document/10589623/
work_keys_str_mv AT bokyeongyoon exploringattentionsparsitytoacceleratetransformertrainingongpus
AT ahhyunlee exploringattentionsparsitytoacceleratetransformertrainingongpus
AT jinsungkim exploringattentionsparsitytoacceleratetransformertrainingongpus
AT gordoneuhyunmoon exploringattentionsparsitytoacceleratetransformertrainingongpus