Exploring Attention Sparsity to Accelerate Transformer Training on GPUs
The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head at...
Saved in:
| Main Authors: | Bokyeong Yoon, Ah-Hyun Lee, Jinsung Kim, Gordon Euhyun Moon |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10589623/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Residual-conditioned sparse transformer for photoacoustic image artifact reduction
by: Xiaoxue Wang, et al.
Published: (2025-08-01) -
Dual Transformers With Latent Amplification for Multivariate Time Series Anomaly Detection
by: Yeji Choi, et al.
Published: (2025-01-01) -
A spiking photonic neural network of 40 000 neurons, trained with latency and rank-order coding for leveraging sparsity
by: Ria Talukder, et al.
Published: (2025-01-01) -
Exploring non-zero position constraints: algorithm-hardware co-designed DNN sparse training method
by: WANG Miao, et al.
Published: (2025-02-01) -
SFDformer: a frequency-based sparse decomposition transformer for air pollution time series prediction
by: Zhenkai Qin, et al.
Published: (2025-03-01)