A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This w...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Micromachines |
Subjects: | |
Online Access: | https://www.mdpi.com/2072-666X/16/1/101 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832587916297109504 |
---|---|
author | Peng Liu Yu Wang |
author_facet | Peng Liu Yu Wang |
author_sort | Peng Liu |
collection | DOAJ |
description | General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches. |
format | Article |
id | doaj-art-4968c03b7bd54f2ba2582d48f3295c27 |
institution | Kabale University |
issn | 2072-666X |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Micromachines |
spelling | doaj-art-4968c03b7bd54f2ba2582d48f3295c272025-01-24T13:42:10ZengMDPI AGMicromachines2072-666X2025-01-0116110110.3390/mi16010101A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary DataflowPeng Liu0Yu Wang1Research Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaResearch Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaGeneral matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.https://www.mdpi.com/2072-666X/16/1/101GEMM acceleratordataflowsparsity |
spellingShingle | Peng Liu Yu Wang A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow Micromachines GEMM accelerator dataflow sparsity |
title | A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow |
title_full | A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow |
title_fullStr | A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow |
title_full_unstemmed | A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow |
title_short | A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow |
title_sort | low power general matrix multiplication accelerator with sparse weight and output stationary dataflow |
topic | GEMM accelerator dataflow sparsity |
url | https://www.mdpi.com/2072-666X/16/1/101 |
work_keys_str_mv | AT pengliu alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT yuwang alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT pengliu lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT yuwang lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow |