Cross-Filter Structured Pruning for Efficient Sparse CNN Acceleration

Convolutional Neural Networks (CNNs) are widely used in vision tasks for resource-constrained environments due to their computational efficiency and strong generalization. However, the dominance of <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></i...

Full description

Saved in:
Bibliographic Details
Main Authors: Ngoc-Son Pham, Sangwon Shin, Lei Xu, Weidong Shi, Taeweon Suh
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11072696/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Convolutional Neural Networks (CNNs) are widely used in vision tasks for resource-constrained environments due to their computational efficiency and strong generalization. However, the dominance of <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></inline-formula> convolutions in modern CNN architectures introduces challenges for sparsity-aware hardware accelerators, particularly in processing element (PE) load balancing, which limits speedup in sparse inference. To address this issue, this paper proposes a cross-filter structured pruning method, enforcing a uniform sparsity pattern across multiple filters to ensure balanced workload distribution among PEs. This approach is further extended to k <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> k convolutions by decomposing them into <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></inline-formula> filters, improving applicability across various CNN layers. This paper also proposes an intra-kernel parallelism technique which significantly reduces the size of PE&#x2019;s local buffers, a critical bottleneck in sparse CNN accelerators. Experimental results show that the proposed approach maintains accuracy comparable to globally unstructured pruning while significantly enhancing inference speed. FPGA implementation and cycle-accurate simulations confirm improvements in processing speed, energy efficiency, and hardware utilization, making this method well-suited for edge and mobile AI applications. Specifically, the proposed architecture achieves <inline-formula> <tex-math notation="LaTeX">$1.14\times $ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$1.6\times $ </tex-math></inline-formula> speedup over Sparten for various CNN models and delivers <inline-formula> <tex-math notation="LaTeX">$7.6\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.9\times $ </tex-math></inline-formula> higher energy efficiency compared to Sparten and StarSPA, respectively. In terms of area efficiency, synthesis results show a <inline-formula> <tex-math notation="LaTeX">$1.73\times $ </tex-math></inline-formula>&#x2013;<inline-formula> <tex-math notation="LaTeX">$10.95\times $ </tex-math></inline-formula> reduction in required hardware primitives compared to Sparten and StarSPA.
ISSN:2169-3536