Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bokyeong Yoon, Ah-Hyun Lee, Jinsung Kim, Gordon Euhyun Moon
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Sparse Transformer sparse attention MHA optimization
Online Access:	https://ieeexplore.ieee.org/document/10589623/
Tags:	Add Tag No Tags, Be the first to tag this record!

Internet

https://ieeexplore.ieee.org/document/10589623/

Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

Internet

Similar Items