Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...

Full description

Saved in:
Bibliographic Details
Main Authors: Bokyeong Yoon, Ah-Hyun Lee, Jinsung Kim, Gordon Euhyun Moon
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10589623/
Tags: Add Tag
No Tags, Be the first to tag this record!