A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow

General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This w...

Full description

Saved in:

Bibliographic Details
Main Authors:	Peng Liu, Yu Wang
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Micromachines
Subjects:	GEMM accelerator dataflow sparsity
Online Access:	https://www.mdpi.com/2072-666X/16/1/101
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587916297109504
author	Peng Liu Yu Wang
author_facet	Peng Liu Yu Wang
author_sort	Peng Liu
collection	DOAJ
description	General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.
format	Article
id	doaj-art-4968c03b7bd54f2ba2582d48f3295c27
institution	Kabale University
issn	2072-666X
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Micromachines
spelling	doaj-art-4968c03b7bd54f2ba2582d48f3295c272025-01-24T13:42:10ZengMDPI AGMicromachines2072-666X2025-01-0116110110.3390/mi16010101A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary DataflowPeng Liu0Yu Wang1Research Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaResearch Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaGeneral matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.https://www.mdpi.com/2072-666X/16/1/101GEMM acceleratordataflowsparsity
spellingShingle	Peng Liu Yu Wang A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow Micromachines GEMM accelerator dataflow sparsity
title	A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_full	A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_fullStr	A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_full_unstemmed	A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_short	A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_sort	low power general matrix multiplication accelerator with sparse weight and output stationary dataflow
topic	GEMM accelerator dataflow sparsity
url	https://www.mdpi.com/2072-666X/16/1/101
work_keys_str_mv	AT pengliu alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT yuwang alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT pengliu lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow AT yuwang lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow

A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow

Similar Items