A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow

General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This w...

Full description

Saved in:
Bibliographic Details
Main Authors: Peng Liu, Yu Wang
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Micromachines
Subjects:
Online Access:https://www.mdpi.com/2072-666X/16/1/101
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587916297109504
author Peng Liu
Yu Wang
author_facet Peng Liu
Yu Wang
author_sort Peng Liu
collection DOAJ
description General matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.
format Article
id doaj-art-4968c03b7bd54f2ba2582d48f3295c27
institution Kabale University
issn 2072-666X
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Micromachines
spelling doaj-art-4968c03b7bd54f2ba2582d48f3295c272025-01-24T13:42:10ZengMDPI AGMicromachines2072-666X2025-01-0116110110.3390/mi16010101A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary DataflowPeng Liu0Yu Wang1Research Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaResearch Center for Novel Computing Sensing and Intelligent Processing, Zhejiang Lab, Hangzhou 311100, ChinaGeneral matrix multiplication (GEMM) in machine learning involves massive computation and data movement, which restricts its deployment on resource-constrained devices. Although data reuse can reduce data movement during GEMM processing, current approaches fail to fully exploit its potential. This work introduces a sparse GEMM accelerator with a weight-and-output stationary (WOS) dataflow and a distributed buffer architecture. It processes GEMM in a compressed format and eliminates on-chip transfers of both weights and partial sums. Furthermore, to map the compressed GEMM of various sizes onto the accelerator, an adaptable mapping scheme is designed. However, the irregular sparsity of weight matrices makes it difficult to store them in local buffers with the compressed format; denser vectors can exceed the buffer capacity, while sparser vectors may lead to the underutilization of buffers. To address this complication, this work also proposes an offline sparsity-aware shuffle strategy for weights, which balances the utilization of distributed buffers and minimizes buffer waste. Finally, a low-cost sparse computing method is applied to the WOS dataflow with globally shared inputs to achieve high computing throughput. Experiments with an FPGA show that the proposed accelerator achieves 1.73× better computing efficiency and 1.36× higher energy efficiency than existing approaches.https://www.mdpi.com/2072-666X/16/1/101GEMM acceleratordataflowsparsity
spellingShingle Peng Liu
Yu Wang
A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
Micromachines
GEMM accelerator
dataflow
sparsity
title A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_full A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_fullStr A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_full_unstemmed A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_short A Low-Power General Matrix Multiplication Accelerator with Sparse Weight-and-Output Stationary Dataflow
title_sort low power general matrix multiplication accelerator with sparse weight and output stationary dataflow
topic GEMM accelerator
dataflow
sparsity
url https://www.mdpi.com/2072-666X/16/1/101
work_keys_str_mv AT pengliu alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow
AT yuwang alowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow
AT pengliu lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow
AT yuwang lowpowergeneralmatrixmultiplicationacceleratorwithsparseweightandoutputstationarydataflow